PANDA: Prototypical Unsupervised Domain Adaptation

by   Dapeng Hu, et al.
National University of Singapore

Previous adversarial domain alignment methods for unsupervised domain adaptation (UDA) pursue conditional domain alignment via intermediate pseudo labels. However, these pseudo labels are generated by independent instances without considering the global data structure and tend to be noisy, making them unreliable for adversarial domain adaptation. Compared with pseudo labels, prototypes are more reliable to represent the data structure resistant to the domain shift since they are summarized over all the relevant instances. In this work, we attempt to calibrate the noisy pseudo labels with prototypes. Specifically, we first obtain a reliable prototypical representation for each instance by multiplying the soft instance predictions with the global prototypes. Based on the prototypical representation, we propose a novel Prototypical Adversarial Learning (PAL) scheme and exploit it to align both feature representations and intermediate prototypes across domains. Besides, with the intermediate prototypes as a proxy, we further minimize the intra-class variance in the target domain to adaptively improve the pseudo labels. Integrating the three objectives, we develop an unified framework termed PrototypicAl uNsupervised Domain Adaptation (PANDA) for UDA. Experiments show that PANDA achieves state-of-the-art or competitive results on multiple UDA benchmarks including both object recognition and semantic segmentation tasks.


Source-Free Domain Adaptive Fundus Image Segmentation with Denoised Pseudo-Labeling

Domain adaptation typically requires to access source domain data to uti...

Generative Pseudo-label Refinement for Unsupervised Domain Adaptation

We investigate and characterize the inherent resilience of conditional G...

Implicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

We present an approach for unsupervised domain adaptation—with a strong ...

Subtype-aware Unsupervised Domain Adaptation for Medical Diagnosis

Recent advances in unsupervised domain adaptation (UDA) show that transf...

Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection

We study adapting trained object detectors to unseen domains manifesting...

Unsupervised Domain Adaptation with Implicit Pseudo Supervision for Semantic Segmentation

Pseudo-labelling is a popular technique in unsuper-vised domain adaptati...

ProxyMix: Proxy-based Mixup Training with Label Refinery for Source-Free Domain Adaptation

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a l...

1 Introduction

Unsupervised domain adaptation (UDA) aims to leverage the knowledge of a labeled data set (source domain) to help train a predictive model for another unlabeled data set (target domain). Deep UDA methods exploit supervisions from heterogeneous sources and bring noticeable performance gain to many tasks, including image classification [long2015learning, saito2017asymmetric] and semantic segmentation [hoffman2016fcns, tsai2018learning, vu2019dada]. Previous methods exploit maximum mean discrepancy (MMD) [gretton2008kernel, long2015learning]

or other distribution statistics like central moments

[sun2016deep, zellinger2017central, koniusz2017domain] for domain adaptation. Recently, adversarial learning [goodfellow2014generative] provides a promising alternative solution to the UDA problem.

Since the labels of the target instances are not provided in UDA, the adversarial learning scheme for adaptation [ganin2015unsupervised]

suffers from the cross-domain misalignment, where the target instances from a class A are probably misaligned with source instances from another different class B. Inspired by the pseudo-labeling strategy from semi-supervised learning, previous methods use the pseudo labels in the target domain to either perform the joint distribution discrepancy minimization

[long2013transfer, long2015learning] or develop the conditional adversarial domain alignment that involves one high-dimensional domain discriminator [long2018conditional] or multiple class-wise domain discriminators [chen2017no, pei2018multi]

. Though effective, these conditional domain adversarial learning methods align the instances from different domains relying only on their own pseudo labels. Output by the classifier trained with source instances, such pseudo labels of the target instances tend to be noisy due to the domain shift. A toy example is given in Fig. 

1(a) where the pseudo label of the chosen target instance is inclined to be class ‘square’ while the ground truth label is class ‘circle’. Only indicated by the instance-level prediction, adversarial domain adaptation methods would push this target instance of class ‘circle’ to source instances of class ‘square’, leading to the misalignment across domains.

(a) conditional adversarial learning
(b) prototypical adversarial learning
Figure 1: Illustration of two adversarial domain adaptation schemes. Different from class-agnostic adversarial learning that pursues the marginal distribution alignment but ignores the semantic consistency across domains, (a) previous conditional adversarial learning methods heavily rely on the instance-level pseudo labels to perform domain alignment, while (b) prototypical adversarial learning integrates the instance-level pseudo labels with prototypes to make the conditional indicators more reliable. Class information is denoted in different shapes with source in solid and target in hollow. Dashed lines are the classification boundaries.

To alleviate the misalignment induced by these noisy pseudo labels, we resort to the reliable prototypes. Specifically, we summarize the prototypes from all instances according to their predictions. In this way, we lessen the damage brought by instances with inaccurate pseudo labels, and encourage the instances with larger certainty to contribute more to the corresponding prototype. The derived prototypes enjoy two advantages over the pseudo labels. On the one hand, prototypes are able to reliably represent the global data structure which is especially ignored by target instance predictions. As shown in Fig. 1(b), the semantic correlation clue that class ‘circle’ is closer to class ‘triangle’ than to class ‘square’ can be reliably described by the similarity relationship among the prototypes. On the other hand, all instances contributing to the derived prototypes are implicitly associated with each other, while pseudo labels are separately relevant to their corresponding instances.

Motivated by these, we first propose a Prototypical Adversarial Learning (PAL) scheme which complements instance predictions with global prototypes to obtain more reliable conditional information for adversarial domain adaptation. To be specific, we breezily attain the reliable global prototypes with the intermediate prototypes during the training via a momentum update. By multiplying the soft instance predictions with such global prototypes, we obtain a reliable prototypical representation for each instance. Through the multiplication, the semantic information within the noisy pseudo labels can be prospectively refined by prototypes. We then concatenate such prototypical representation with the original feature representation for the domain adversarial learning.

The PAL scheme refines the semantic information of noisy pseudo labels with global prototypes but does not improve the pseudo labels themselves directly. Thus we further introduce a simple objective to adaptively strengthen the pseudo labels. Specifically, taking the target intermediate prototypes as a proxy, we encourage the intra-class compactness in the target domain to implicitly enhance the pseudo labels. Finally, we develop a PrototypicAl uNsupervised Domain Adaptation (PANDA) framework that promotes the intra-class compactness and aligns both the instance feature representations and the intermediate prototypes through the proposed PAL scheme. Experimental results on both object recognition and semantic segmentation tasks clearly demonstrate the advantages of our approaches over previous state-of-the-arts [long2018conditional, xu2019unsupervised, luo2019taking, tsai2019domain].

The contributions of this work can be summarized into three folds:

1) As far as we know, it is the first work to leverage the prototypes in domain adversarial learning to alleviate the misalignment induced by noisy pseudo labels;

2) We propose a novel and integrated framework PANDA to calibrate the pseudo labels for unsupervised domain adaptation;

3) The proposed PANDA framework is generic that achieves the state-of-the-art or competitive results on two typical transfer tasks, i.e., cross-domain object recognition and synthetic-to-real semantic segmentation.

2 Related Work

Unsupervised Domain Adaptation. UDA is first modeled as the covariate shift problem [shimodaira2000improving] where marginal distributions of different domains are different but their conditional distributions are the same. To address it, [dudik2006correcting, huang2007correcting] exploit a non-parametric instance re-weighting scheme. Another prevailing paradigm [pan2010domain, long2013transfer, herath2017learning] aims to learn feature transformation with some popular cross-domain metrics, e.g., the empirical maximum mean discrepancy (MMD) statistics. Recently, a large number of deep UDA works [long2015learning, haeusser2017associative, saito2018maximum, tsai2018learning] have been developed and boosted the performance of various vision tasks. Generally, they can be divided into discrepancy-based and adversarial-based methods. Discrepancy-based methods [tzeng2014deep, long2017deep] address the dataset shift by mitigating specific discrepancies defined on different layers of a shared model between domains, e.g. resembling shallow feature transformation by matching higher moment statistics of features from different domains [zellinger2017central, koniusz2017domain]. Recently, adversarial learning has become a dominantly popular solution to domain adaptation problems. It leverages an extra domain discriminator to promote domain confusion. [ganin2015unsupervised] designs a gradient reversal layer inside the classification network and [tzeng2017adversarial] utilizes an inverted label GAN loss to fool the discriminator.

Pseudo-labeling. UDA can be regarded as a semi-supervised learning (SSL) task where unlabeled data are replaced by the target instances. Therefore, some popular SSL strategies, e.g., entropy minimization [grandvalet2005semi, vu2019advent], mean-teacher [tarvainen2017mean, french2017self] and virtual adversarial training [miyato2018virtual, shu2018dirt], have been successfully applied to UDA. Pseudo-labeling is favored by most UDA methods due to its convenience. For example, [saito2017asymmetric, li2019bidirectional] exploit the intermediate pseudo-labels with tri-training and self-training, respectively. Recently, curriculum learning [choi2019pseudo], self-paced learning [zou2018unsupervised] and re-weighting schemes [long2018conditional] are further leveraged to tackle possible false pseudo-labels.

Conditional Domain Adaptation. Apart from the explicit integration with the last classifier layer, pseudo-labels can also be incorporated into adversarial learning to enhance the feature-level domain alignment. Concerning shallow methods [long2013transfer, zhang2017joint], pseudo-labels can help mitigate the joint distribution discrepancy via minimizing multiple class-wise MMD measures. [long2017deep] proposes to align the joint distributions of multiple domain-specific layers across domains based on a joint maximum mean discrepancy criterion. Recently, [chen2017no, pei2018multi] leverages the probabilities with multiple domain discriminators to enable fine-grained alignment of different data distributions in an end-to-end manner. In contrast, [long2018conditional] conditions the adversarial domain adaptation on discriminative information via the outer product of feature representation and classifier prediction. Motivated by the semantically-consistent GAN, [cicek2019unsupervised] imposes a multi-way adversarial loss instead of a binary one on the domain alignment.

However, all these methods highly rely on the instance-level pseudo labels to align label-conditional feature distributions, which is risky due to noisy pseudo labels. This work proposes to exploit reliable prototypes to calibrate the noisy pseudo labels and guide the domain adversarial learning. Specifically, it complements the original feature representations with reliable prototype-based semantic features and merely involves two low-dimensional domain discriminators, making the domain alignment process simple, conditional and reliable. In reality, some previous works [pinheiro2018unsupervised, pan2019transferrable, chen2019progressive, xie2018learning] have utilized prototypes for solving the UDA problem. Unlike our work, [pinheiro2018unsupervised, pan2019transferrable] exploit prototypes as the classifier and [pan2019transferrable, chen2019progressive, xie2018learning] aim to minimize MMD measures between prototypes from different domains.

3 Method

In this section, we first begin with the basic settings of UDA and then give detailed descriptions on the proposed PAL scheme and the PANDA framework. Though introduced in the context of image classification task, they can also be readily applied to semantic segmentation.

3.1 Problem Settings

In a vanilla UDA task, we are given label-rich source domain data sampled from the joint distribution and unlabeled target domain data sampled from the joint distribution , where and denote an image and its corresponding label from the source domain dataset, denotes an image from the target domain dataset and . The goal of UDA is to learn a discriminative model from , and to predict labels for unlabeled target samples .


Figure 2: Overview of the proposed PANDA framework which consists of a shared feature extractor , a shared classifier and two domain discriminators (). is the global prototype matrix while represents the intermediate prototypes computed by source or target instances in the batch.

As described in [ganin2016domain], a vanilla domain adversarial learning framework consists of a feature extractor network , a classifier network and a discriminator network . Given an image

, we denote the feature representation vector extracted by

as and the probability prediction obtained by as where means the feature dimension and means the number of classes. The vanilla domain adversarial learning method in [ganin2016domain] can be formulated as optimizing the following minimax optimization problem:


where the binary classifier predicts the domain assignment probability over the input features, is the cross-entropy loss of source domain data as for the classification task, and is the trade-off parameter.

3.2 Prototypical Adversarial Learning (PAL) Scheme

The misalignment in UDA of multi-class distributions challenges the popular vanilla domain adversarial learning methods. In previous works [long2017deep, pei2018multi, long2018conditional], target domain data are conditioned only on corresponding pseudo labels predicted by the classifier for adversarial domain alignment. The general optimization process of these methods is the same as the aforementioned vanilla domain adversarial learning framework, except that feature representations jointly with predictions are considered by the discriminator :


is the conditional adversarial loss that leverages the classification probability predictions and . A classic previous work [long2018conditional] implicitly conditions the feature representation on the prediction through the outer product , and uses one shared discriminator to align the conditioned feature representations. [long2018conditional] further proves that using the outer product can perform much better than the simple concatenation, i.e., . Different from [long2018conditional], [chen2017no, pei2018multi] explicitly utilize multiple class-wise domain discriminators to align the feature representations relying on the pseudo labels of corresponding instances.

However, the pseudo labels or target instance predictions output by the classifier may be noisy due to the domain shift. Therefore, only conditioning the alignment on pseudo labels can not safely mitigate the misalignment. Compared with the pseudo labels, the prototypes are more robust and reliable in terms of representing the global data structure [yang2018robust].

Especially, we summarize the prototypes all over the instances according to the corresponding instance predictions. Using predictions as weights can adaptively control the contributions of typical and non-typical instances to the class prototype, reducing the damage by instances with inaccurate pseudo labels. In reality, we first gather the feature representation of each instance relying on its prediction to generate the batch-level prototypes, i.e., intermediate prototypes. Then the reliable global prototypes can be obtained by virtue of a momentum update strategy such as exponential moving average (ema) on the batch ones. This process can be formulated as the following:


Here means the batch size, represents the probability of the -th instance belonging to the -th semantic class, is an empirical weight, is the batch-level class prototype matrix and is the reliable global prototypes computed by only source domain instances.

To acquire more reliable conditional information for domain adversarial learning, we propose to complement instance predictions with the reliable global prototypes and reformulate the adversarial loss to:


Here denotes the global class prototype matrix in our prototypical adversarial learning loss . As shown in Eq. (7), for each independent instance, the reliable prototypical representation is obtained through multiplying the global prototypes with its prediction . Thus the semantic information of noisy pseudo labels can be refined by the reliable global prototypes. Similarly, for each instance in this batch, we multiply batch-level prototypes with its prediction to obtain the intermediate prototypical representation, i.e., which can be denoted as .

3.3 Prototypical Unsupervised Domain Adaptation (PANDA) Framework

Based on our PAL scheme as well as the prototypical representation, we build a PrototypicAl uNsupervised Domain Adaptation (PANDA) framework. This framework aligns both instance-level and prototype-level feature representations through PAL and promotes the intra-class compactness in the target domain. Along these lines, the misalignment induced by noisy pseudo labels can be further alleviated even though no supervision is available in the target domain. The overall architecture of PANDA is shown in Fig. 2.

Besides the backbone feature extractor and the task classifier , there are two discriminators in our PANDA framework, i.e., the instance-level feature discriminator and the prototype-level feature discriminator . We can formulate our general objective function as (w.l.o.g., ),



denotes the balancing factors among different loss functions,

is the supervised classification loss on source domain data described by Eq. (3), is the adversarial loss to align instance feature representations across domains, is the adversarial loss to align the intermediate prototypes across domains, and is the loss to promote the intra-class compactness in target domain.

Instance-Level Alignment. Conditioning the instance feature representation on our reliable prototypical representation, we seek to align feature representations across domains at the instance-level through discriminator . With the reliable conditional information, the misalignment in adversarial domain adaptation can be hopefully alleviated. We can define the instance-level adversarial loss as

Input: Feature extractor , task classifier , domain discriminators and , source data , target data ;
Parameters: Total training iterations , batch size , loss weights , and , and ;
Randomly initialize the global prototypes ;
for i = 1 to  do
       Sample source batch and target batch respectively;
       Obtain features and through , then obtain predictions and through ;
       Obtain intermediate source prototypes with and by Eq. (5);
       Obtain intermediate target prototypes with and by Eq. (5);
       Update the global prototypes with by Eq. (6);
       Obtain the prototypical representation for each instance by multiplying the prototypes with its prediction: = , ;
       Update by optimizing Eq. (12);
end for
Output: G, F
Algorithm 1 How does our PANDA work?

Prototype-Level Alignment. Instance-level alignment only implicitly aligns the multi-class distribution across domains, which may not achieve the semantic consistency between two domains. Besides, since in practice global prototypes are collected from only source domain instances, which possibly cannot accurately represent the data structure of the target domain due to the domain shift. Taking into account these two factors, we perform the prototype-level alignment with discriminator to explicitly align the intermediate prototypical representation across domains. The specific loss function is defined as


Intra-Class Compactness. PAL exploits the reliable prototypes to only refine the semantic information of noisy pseudo labels without directly improving the pseudo labels. In reality, the intermediate prototypes have associated all the instances within the current training batch. By virtue of this, we take the intermediate prototypes as a proxy and further pursue the intra-class compactness in the target domain to explicitly improve the pseudo labels. In such a way, the pseudo labels are updated to enlarge the margin among different classes, which helps mitigate the misalignment. Particularly, for target domain data, we minimize the Euclidean distance between the feature representations and the intermediate prototypical representations to encourage the intra-class compactness:


Therefore, the complete minimax optimization problem of our PANDA framework can be formulated as


The learning process of the PANDA framework is summarized in Algorithm 1. With only two low-dimensional () discriminators added, we propose a simple yet generic framework PANDA for the UDA problem.

4 Experiments

4.1 Experimental Setup

We conduct experiments to verify the effectiveness and generalization ability of our methods, i.e., PANDA (full) in Eq. (12) and PAL () on two different UDA tasks, including cross-domain object recognition on Office-Home [venkateswara2017Deep], ImageCLEF-DA111 and Office31 [saenko2010adapting], and synthetic-to-real semantic segmentation for GTA5 [richter2016playing]Cityscapes [cordts2016cityscapes] and Synthia [ros2016synthia] Cityscapes.

Datasets. Office-Home is a new challenging dataset that consists of 65 different object categories found typically in 4 different Office and Home settings, i.e., Artistic (Ar) images, Clip Art (Ca), Product images (Pr) and Real-World (Re) images. ImageCLEF-DA is a dataset built for the ‘ImageCLEF2014:domain-adaptation’ competition. We follow [long2015learning] to select 3 subsets, i.e., C, I and P, which share 12 common object classes. Office31 is a popular dataset that includes 31 object categories taken from 3 domains, i.e., Amazon (A), DSLR (D) and Webcam (W).

Cityscapes is a realistic dataset of pixel-level annotated urban street scenes. We use its original training split and validation split as the training target data and testing target data respectively. GTA5 consists of 24,966 densely labeled synthetic road scenes annotated with the same 19 classes as Cityscapes. We take the SYNTHIA-RAND-CITYSCAPES set as the source domain for Synthia, containing 9,400 synthetic images compatible with 16 annotated classes of Cityscapes [zhang2017curriculum].

Implementation Details. For object recognition, we follow the standard protocol [ganin2015unsupervised], i.e. using all the labeled source instances and all the unlabeled target instances for UDA, and report the average accuracy based on three random trials for fair comparisons. Following [long2018conditional]

, we experiment with ResNet-50 model pretrained on ImageNet. Specifically, we follow


to set the data transformation and choose the network parameters. The whole model is trained through backpropagation, where

=2, and increase from 0 to 1 with the same strategy as [ganin2015unsupervised]. Regarding the domain discriminator, we design a simple two-layer classifier (25610241) for both and . Empirically, we fix the batch size to 36 with the initial learning rate being 1e-4.

Methods ArCl ArPr ArRe ClAr ClPr ClRe PrAr PrCl PrRe ReAr ReCl RePr Avg. ResNet-50 [he2016deep] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1 DANN [ganin2015unsupervised] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6 CDAN [long2018conditional] 49.0 69.3 74.5 54.4 66.0 68.4 55.6 48.3 75.9 68.4 55.4 80.5 63.8 CDAN+E [long2018conditional] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8 DWT-MEC [roy2019unsupervised] 50.3 72.1 77.0 59.6 69.3 70.2 58.3 48.1 77.3 69.3 53.6 82.0 65.6 SAFN* [xu2019unsupervised] 52.0 73.3 76.3 64.2 69.9 71.9 63.7 51.4 77.1 70.9 57.1 81.5 67.3 PAL 50.4 68.8 72.7 58.0 65.7 68.5 56.1 50.4 74.2 67.6 56.0 79.6 64.0 0.5 0.3 0.3 0.5 0.3 0.2 0.2 0.6 0.1 0.1 0.2 0.4 PANDA 52.4 73.4 79.0 64.2 74.2 73.2 63.0 53.0 79.5 73.4 56.7 83.5 68.8 0.5 0.4 0.3 0.1 0.5 0.1 0.4 0.4 0.3 0.3 0.2 0.2 PAL* 55.6 70.9 76.6 62.4 69.9 72.4 62.2 54.8 78.4 72.2 59.3 82.7 68.1 0.1 0.5 0.1 0.6 0.8 0.3 0.5 0.5 0.3 0.2 0.3 0.2 PANDA* 59.3 75.2 80.6 66.7 76.9 77.2 67.1 56.4 82.2 73.8 60.8 84.2 71.7 0.0 0.3 0.3 0.4 0.8 0.4 0.7 0.7 0.3 0.4 0.2 0.4

Table 1: Recognition accuracies (%) on Office-Home via ResNet-50. Red: Best, Blue: Second best. Methods with the notion of * in Tables 1 and 2 use the data transformation in SAFN [xu2019unsupervised].

Datasets ImageCLEF-DA Avg. Office31 Avg. Methods CI CP IC IP PC PI AD AW DA DW WA WD ResNet-50 [he2016deep] 78.0 65.5 91.5 74.8 91.2 83.9 80.7 68.9 68.4 62.5 96.7 60.7 99.3 76.1 DANN [ganin2015unsupervised] 87.0 74.3 96.2 75.0 91.5 86.0 85.0 79.7 82.0 68.2 96.9 67.4 99.1 82.2 CDAN [long2018conditional] 90.5 74.5 97.0 76.7 93.5 90.6 87.1 89.8 93.1 70.1 98.2 68.0 100. 86.6 CDAN+E [long2018conditional] 91.3 74.2 97.7 77.7 94.3 90.7 87.7 92.9 94.1 71.0 98.6 69.3 100. 87.7 iCAN [zhang2018collaborative] 89.9 78.5 94.7 79.5 92.0 89.7 87.4 90.1 92.5 72.1 98.8 69.9 100. 87.2 CAT [deng2019cluster] 91.3 75.3 95.5 77.2 93.6 91.0 87.3 90.8 94.4 72.2 98.0 70.2 100. 87.6 SAFN* [xu2019unsupervised] 91.1 77.0 96.2 78.0 94.7 91.7 88.1 87.7 88.8 69.8 98.4 69.7 99.8 85.7 SAFN+Ent* [xu2019unsupervised] 91.7 77.6 96.3 79.3 95.3 93.3 88.9 90.7 90.1 73.0 98.6 70.2 99.8 87.1 PAL 90.1 75.2 97.2 77.4 94.4 91.6 87.6 93.5 94.6 70.4 97.4 68.8 100. 87.4 0.3 0.3 0.2 0.2 0.2 0.2 0.4 0.2 0.4 0.1 1.4 0.0 PANDA 91.7 75.9 97.7 78.6 96.9 93.1 89.0 94.2 94.9 73.9 97.8 72.8 99.8 88.9 0.2 0.6 0.2 0.3 0.1 0.2 0.3 0.2 0.2 0.1 0.3 0.0 PAL* 92.6 78.8 96.8 77.9 94.3 91.3 88.6 93.0 93.2 74.5 97.9 73.1 100 88.6 0.5 0.4 0.2 0.1 0.1 0.2 0.3 0.1 0.1 0.1 0.3 0.0 PANDA* 93.2 77.6 97.4 80.2 96.8 93.3 89.8 93.6 94.7 76.0 98.4 74.7 99.5 89.5 0.0 0.3 0.1 0.3 0.2 0.1 0.3 0.7 0.1 0.1 0.1 0.1

Table 2: Accuracies (%) on ImageCLEF-DA and Office31 via ResNet-50.

For semantic segmentation, we adopt DeepLab-V2 [chen2017deeplab] based on ResNet-101 [he2016deep] as done in [tsai2018learning, vu2019advent, luo2019taking, tsai2019domain]. Following DCGAN [radford2015unsupervised], the discriminator network consists of three

convolutional layers with stride 2 and channel numbers {256, 512, 1}. In training, we use SGD 

[bottou2010large] to optimize the network with momentum (0.9), weight decay (5e-4) and initial learning rate (2.5e-4). We use the same learning rate policy as in [chen2017deeplab]. Discriminators are optimized by Adam [kingma2014adam] with momentum (, ), initial learning rate (1e-4) along with the same decreasing strategy as above. For both tasks, is set to 1e-3 following [tsai2018learning]. For GTA5Cityscapes, =1e-3 and =1e-4. For SynthiaCityscapes, =1e-5 and =1e-1.

All experiments are implemented via PyTorch on a single Titan V GPU. The total iteration number is set as 10k for object recognition and 100k for semantic segmentation. The momentum value is set to 0.5 for all tasks without selection. For objection recognition tasks, we choose the hyper-parameters which have the minimal mean entropy of target predictions [morerio2018minimalentropy] for ArCl on Office-Home. For semantic segmentation tasks, we reserve of the training target data of Cityscapes for parameters selection. Data augmentation skills like random scale or ten-crop ensemble evaluation are not adopted.

4.2 Comparison Results

To be fair, we compare our methods (i.e., PAL and PANDA) to other methods which are implemented under the similar experiment setting with ours for both object recognition tasks and semantic segmentation tasks.

Cross-Domain Object Recognition. The comparison results between our methods (i.e., PAL and PANDA) and some previous state-of-the-art (SOTA) methods [xu2019unsupervised, long2018conditional, zhang2018collaborative] on Office-Home, ImageCLEF-DA and Office31 are shown in Tables 1 and 2, respectively. As indicated in these tables, PANDA improves previous approaches in the average accuracy for all three benchmarks (i.e., 67.3% 68.8% for Office-Home, 88.9% 89.0% for ImageCLEF-DA, and 87.7% 88.9% for Office31). Generally, PANDA performs the best for most transfer tasks, i.e., 10 out of 12 tasks on Office-Home, 3 out of 6 tasks on ImageCLEF-DA and 4 out of 6 tasks on Office31. Taking a careful look at PAL, we find that it always beats CDAN and achieves competitive performance with SOTA methods like CAT [deng2019cluster], which shows our prototypical representation is more reliable than instance-level pseudo labels for the adversarial domain alignment.

Synthetic-to-real Semantic Segmentation. We compare our PAL and PANDA with SOTA approaches [tsai2018learning, vu2019advent, luo2019taking, tsai2019domain] on synthetic-to-real semantic segmentation. Following  [chen2017no], for GTA5Cityscapes, we evaluate models on all 19 classes, while for SynthiaCityscapes, results of only 13 classes are evaluated excluding wall, fence and pole. As shown in Table 3, our PAL method outperforms almost all of these methods. Our PANDA framework further achieves the best among these competitive methods on both segmentation tasks, i.e., 45.5% 46.5% for GTA5Cityscapes and 48.0% 48.9% for SynthiaCityscapes in terms of the mean IoU (mIoU) value, which shows our PANDA is generic. It is worth noting that we directly extend PANDA to semantic segmentation by taking each pixel as the concept of ‘instance’ in image classification and equally consider all pixels for the target intra-class compactness objective.

Methods road sdwk bldng wall fence pole light sign veg. ter. sky per. rider car truck bus train mbike bike mIoU source: GTA5 NonAdapt [tsai2018learning] 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6 AdaptSeg(single) [tsai2018learning] 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 29.5 32.5 41.4 AdaptSeg(multi) [tsai2018learning] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4 AdvEnt [vu2019advent] 89.9 36.5 81.6 29.2 25.2 28.5 32.3 22.4 83.9 34.0 77.1 57.4 27.9 83.7 29.4 39.1 1.5 28.4 23.3 43.8 AdvEnt+MinEnt [vu2019advent] 89.4 33.1 81.0 26.6 26.8 27.2 33.5 24.7 83.9 36.7 78.8 58.7 30.5 84.8 38.5 44.5 1.7 31.6 32.4 45.5 CLAN [luo2019taking] 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 31.9 31.4 43.2 AdaptPatch [tsai2019domain] 89.2 38.4 80.4 24.4 21.0 27.7 32.9 16.1 83.1 34.1 77.8 57.4 27.6 78.6 31.2 40.2 4.7 27.6 27.6 43.2 PAL 89.7 29.7 82.2 29.4 25.7 29.1 34.9 21.6 83.4 35.2 78.0 60.0 29.0 84.3 33.4 45.4 8.0 26.3 28.0 44.9 PANDA 92.4 51.3 82.9 31.8 24.9 32.6 35.8 20.4 84.5 38.7 79.8 60.0 25.8 85.1 33.7 44.1 9.0 27.5 22.6 46.5 source: Synthia NonAdapt [tsai2018learning] 55.6 23.8 74.6 - - - 6.1 12.1 74.8 - 79.0 55.3 19.1 39.6 - 23.3 - 13.7 25.0 38.6 AdaptSeg(single) [tsai2018learning] 79.2 37.2 78.8 - - - 9.9 10.5 78.2 - 80.5 53.5 19.6 67.0 - 29.5 - 21.6 31.3 45.9 AdaptSeg(multi) [tsai2018learning] 84.3 42.7 77.5 - - - 4.7 7.0 77.9 - 82.5 54.3 21.0 72.3 - 32.2 - 18.9 32.3 46.7 AdvEnt [vu2019advent] 87.0 44.1 79.7 - - - 4.8 7.2 80.1 - 83.6 56.4 23.7 72.7 - 32.6 - 12.8 33.7 47.6 AdvEnt+MinEnt [vu2019advent] 85.6 42.2 79.7 - - - 5.4 8.1 80.4 - 84.1 57.9 23.8 73.3 - 36.4 - 14.2 33.0 48.0 CLAN [luo2019taking] 81.3 37.0 80.1 - - - 16.1 13.7 78.2 - 81.5 53.4 21.2 73.0 - 32.9 - 22.6 30.7 47.8 AdaptPatch [tsai2019domain] 82.2 39.4 79.4 - - - 6.5 10.8 77.8 - 82.0 54.9 21.1 67.7 - 30.7 - 17.8 32.2 46.3 PAL 86.9 42.8 79.6 - - - 7.7 9.2 79.0 - 82.2 55.9 20.9 81.2 - 35.2 - 17.2 30.7 48.3 PANDA 88.1 44.2 81.1 - - - 10.0 11.1 80.3 - 84.3 42.8 21.6 82.5 - 34.6 - 16.9 38.7 48.9

Table 3: Comparison results of synthetic-to-real semantic segmentation based on the adversarial learning using the same architecture, i.e., DeepLab-V2 framework with ResNet-101.

4.3 Further Analysis

Ablation Study. To analyze whether each component in Eq. (12) is effective, we introduce a variant named PAL that merely ignores the intra-class objective (). The empirical convergence curves about ArCl on Office-Home in Fig. 3(a) imply that all of our variants tend to converge after 10k iterations, and the second term, i.e., the prototype-level conditional alignment, can help accelerate the convergence. Fig. 3(b) shows that all components in the PANDA framework, i.e., PAL alignment at different levels and the intra-class objective, can bring evident improvements on both classification and segmentation tasks.

Domain Discrepancy. As shown in Fig. 3(c), we provide the proxy -distances [ganin2016domain] of different methods for ArCl on Office-Home and CI on ImageCLEF-DA. The -distance = is a popular measure for domain discrepancy, where is the test error of a binary classifier trained on the learned features. It is clear that, by adversarial domain alignment, all UDA methods have smaller distances than the model before adaptation, i.e., ‘source only’. Besides, our PANDA has the minimum distance for both tasks, implying that it can learn better features to bridge the domain shift between the domains.

(a) Convergence (b) Ablation (c) Proxy -distance (d)Base network
Figure 3: Quantitative analysis of our methods.
(a) DANN [ganin2015unsupervised] (b) CDAN [long2018conditional] (c) PAL (d) PANDA
Figure 4: t-SNE [maaten2008visualizing] embedding visualizations of UDA methods for the CI task on ImageCLEF-DA. Class information is denoted by different shapes (source in with target in ) in the upper row and colors in the bottom row denote different domains (red: source, blue: target).



Src only



Figure 5: Qualitative results of different methods on synthetic-to-real semantic segmentation for GTA5Cityscapes. Tags of the rows are marked on the left side where ‘GT’ is the ground truth and ‘Src only’ means the result before adaptation.
Value 0.01 0.02 0.05 0.1 0.2 0.5 1 2 5 0.1 0.3 0.5 0.7 0.9
Acc. (%) 51.8 51.7 51.8 52.0 52.3 52.5 52.0 52.4 30.5 52.2 51.9 52.4 52.0 50.9
Table 4: Sensitivity of and for ArCl on Office-Home.

Network Sensitivity. To testify the sensitivity of our PANDA to the network architecture, in Fig. 3(d) we report the accuracies of DANN, CDAN and PANDA for CI with 3 different backbone architectures, i.e., VGG-16, ResNet-18 and ResNet-50. We show that PANDA is always the best-performing method and shows desirable robustness to the network change.

Parameter Analysis. PANDA is based on adversarial domain alignment. Therefore, for the balancing weights of the adversarial loss terms in Eq. (12), i.e., and , we follow the default settings in previous works [long2018conditional, tsai2018learning] without tuning them except for the semantic segmentation task of SynthiaCityscapes where we set smaller value for . Because some target classes such as ‘terrain’, ‘truck’ and ‘train’, are not observed in the source domain, strong alignment of these prototypes across domains may cause the misalignment.

Here we testify the sensitivity of our PANDA to the two newly-added hyper-parameters, i.e., and , by conducting the case study for ArCl. The classification accuracy (%) on ArCl with different and values are shown in Table 4. Obviously, our prototype-based intra-class objective stably brings improvement under a wide range of , i.e., from 0.1 to 2. For , the results show that large momentum such as 0.9 would hurt the performance. Because global prototypes are only summarized over source batch features and features would become more domain invariant along with the adaptation, heavy dependence on the history features would make the global prototypes less domain invariant.

Visualizations. For object recognition, we study the t-SNE visualizations of aligned features generated by different UDA methods in Fig. 4. As expected, conditional adversarial learning methods including CDAN and PAL can semantically align the multi-class distributions much better than DANN. Besides, PAL learns slightly better features than CDAN because prototypical representation are more reliable. Once considering the prototype-based intra-class objective, PANDA further enhances PAL by adaptively improving the pseudo labels, which achieves the best adaptation performance. For semantic segmentation, we present some qualitative results in Fig. 5. Similarly, PAL effectively improves the adaptation performance and PANDA can further improve the target segmentation results.

5 Conclusion

In this work, we develop the novel and generic PAL scheme for solving the UDA problem. Unlike previous adversarial domain adaptation methods solely relying on instance predictions, PAL first exploits the global prototypes to refine the semantic information of noisy pseudo labels and then leverages the derived reliable prototypical representations to achieve the conditional adversarial domain alignment. We further complement this scheme by imposing the intra-class compactness with the intermediate prototypes as a proxy to adaptively improve the pseudo labels, thus obtaining the integrated framework PANDA. Extensive evaluations on both object recognition and semantic segmentation tasks clearly justify the effectiveness and superiority of PANDA over well-established UDA baselines.