Given large amounts of annotated training data, current fully-supervised semantic segmentation algorithms deliver outstanding results. Because annotating many images at the pixel-level is expensive, a common practice is to generate synthetic data and to rely on unsupervised domain adaptation to bridge the gap from the synthetic source to the real-world target domain.
In this paper, we introduce a novel approach to aligning cross-domain features for both unsupervised and semi-supervised domain adaptation. In the first case, no target domain labels are given, whereas in the second one only a small amount of annotated data is available in the target domain. In practice, this second scenario is important because even a handful of target domain labels can boost the performance significantly.
Our key insight is that patches from both domains that are structurally similar in label space should also have similar distributions in feature space. To enforce this, which goes beyond what can be done using ordinary pixel-wise similarity, we introduce a patch-wise metric. It measures label disparity at several levels of resolution along with a contrastive loss  that, when minimized, aligns the feature distribution closer for patches with similar structures in label space and pushes them apart otherwise, see Fig. 1.
To perform unsupervised or semi-supervised training, we incorporate unlabeled pixels into training by using pseudo labels that we iteratively generate using the output of the partially trained network. At each training iteration, we use our patch-wise metric to decrease the feature space disparity of patches that are structurally similar and to increase it for those that are not. This renders a more straightforward approach than adversarial learning, previously often used for cross-domain feature alignment. We do not require an extra discriminator network and therefore eliminate the sometimes substantial difficulty of having to train it.
Our experiments show that our approach delivers a performance increase over state-of-the-art methods in the unsupervised regime [43, 49] and an even larger boost in the semi-supervised one [45, 20]. In a practical setting, we believe the latter to be particularly significant because, while it is very tedious to get huge amounts of annotated frames, it is almost always possible to supply a few annotated frames. Our contribution is therefore a new contrastive learning approach to aligning features across different domains for semantic segmentation. It relies on structural label disparity instead of adversarial learning and outperforms state-of-the-art methods for unsupervised domain adaptation and even more for semi-supervised domain adaptation. We also show how our approach can be extended to weakly-supervised domain adaptation, where only a minor drop in accuracy can save up to of annotation cost.
2 Related Work
Most recent works on domain adaptation focus on unsupervised methods, with only very few incorporating the limited amount of supervision we advocate. We briefly review these two classes of approaches and then discuss contrastive learning, which is a central component of our approach.
Unsupervised Domain Adaptation (UDA) for Semantic Segmentation
aims to align the source and target domain feature distributions given annotated data only in the source domain. A popular approach is to leverage adversarial learning to generate domain-invariant features. This trend started with  and was extended to different levels of representation, including feature space [6, 16, 32, 38] and label space [7, 17, 30, 42, 43, 44]. Notable extensions are [4, 7, 32] which enforce class-wise alignment to narrow the distributions to be matched. However, all these adversarial learning approaches rely on extra discriminator networks which are complicated and hard to train jointly with the generator network.
Other widely used UDA approaches to semantic segmentation include generating realistic-looking synthetic images [18, 39, 46, 48, 49, 51], using pseudo labels for self-training [24, 26, 40, 49, 53], and leveraging weak labels [28, 32]. Among these methods, FDA  is a simple approach which achieves state-of-the-art performance by generating realistic-looking synthetic images directly in Fourier space and self-training.
Our approach builds on FDA and incorporates a novel domain-wise contrastive loss, without the reliance on complex adversarial learning.
Semi-Supervised Domain Adaptation for Semantic Segmentation (SSDA)
assumes that a handful of target domain labels are available. In the context of semantic segmentation it has not received as much attention as UDA. For example, 
achieves adaptation by alternately maximizing the conditional entropy of unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder. In, a two-stream architecture is proposed, where one stream operates in the source and the other in the target domain. In contrast to others, the weights in corresponding layers are related but not shared and optimized to deliver good performance in both domains. In , target domain samples are perturbed to reduce intra-domain discrepancy using adversarial learning.
However, none of these works have been demonstrated for semantic segmentation, nor do they leverage contrastive learning for SSDA. The only ones that do are  and . In , class-wise adversarial learning is used to promote the similarity of pixel-level feature representations for the same classes. In , a pixel-level entropy regularization scheme is introduced to favor feature alignment among multiple domains. Therefore, domain alignment is only enforced on the pixel-level, whereas ours is done at a more semantic level.
Contrastive Learning (CL)
aims to learn visual representations by leveraging both similar and dissimilar samples. Early work  showcased improved visual representation by contrasting positive pairs against negative ones. Deep CL has been used extensively [10, 47, 52, 41, 13, 29, 5, 25, 9, 50, 19, 21]
for applications such as image classification, image-to-image translation or phrase grounding .
In the context of semantic segmentation, CL has also been used for intra-domain model pre-training . The approach is specifically designed for use in conjunction with a U-Net  backbone and does not generalize well to other architectures. In our experiments, we show that using CL directly as loss term achieves superior results compared to CL only used for model pre-training.
We first formalize the problem of domain adaptation in Sec. 3.1. We then give an overview of our proposed model in Sec. 3.2 and explain how contrastive learning is used for domain adaptation in Sec. 3.3 and 3.4.
3.1 Problem Formulation
Let be a source-domain dataset, where denotes a color image and the corresponding semantic map. The target-domain dataset is split into two sets. The first is a labeled set with ground truth semantic maps. The second is the unlabeled set without ground truth semantic labels. We have in the Unsupervised Domain Adaptation (UDA) setting and for Semi-Supervised Domain Adaptation (SSDA). In most real-world scenarios, we have and . Given the three data sets , and , the task is to learn a single model that performs well on previously unseen target domain data.
3.2 Overview of Our Approach
Our model consists of an encoder that maps an input image
to a list of feature vectorsand two decoders and that map these intermediate features to a list of latent vectors and a semantic map . Let and be the networks that take as input an image and return a semantic map and a latent vector respectively. Fig. 2
gives an overview of our network architecture and the loss functions, which we describe below.
In practice, the functions and
are convolutional neural networks, but for the sake of describing our contrastive loss in Sec.3.3, a patch-wise representation is useful. Therefore, each image is decoded into latent vectors, corresponding to rectangular patches for and the encoder preserves the association between features and patches so that we can write
where and are local features and latent vectors associated to patch . In the remainder of the paper, we will denote by , and patches of the source, labeled and unlabeled target images.
Loss functions for classification:
Training all parameters of our model involves minimizing several loss functions, including the novel contrastive loss that we describe in Sec. 3.3. For the per-pixel classification task, we also use supervised cross-entropy losses for the source domain images and, if available, labeled target domain images, respectively, which are defined as
Additionally, we employ a regularization loss
We summarize these three losses into our base loss
with being a small scaling factor, usually .
After some initial training steps, we can use to assign pseudo labels to unlabeled images and to compute an additional cross-entropy loss term
where are the pseudo labels. In Sec. 4.5, we demonstrate that pseudo labeling becomes more effective in the SSDA setting, as compared to an unsupervised setting, because pseudo labels are more reliable.
3.3 Contrastive Learning for Domain Adaptation
The main contribution of this paper is to leverage contrastive learning for domain adaptation. Instead of relying on an adversarial training scheme, as most prior works do [45, 43], we use a contrastive loss on pairs of patches from different domains. The goal is to bring the representation of positive pairs closer together, while pushing negative pairs apart. A benefit of our approach is that optimizing this loss is relatively simpler, compared to adversarial learning, which involves a min-max optimization scheme. However, the main challenge is to define positive and negative pairs of patches across domains for the contrastive loss.
3.3.1 Matching of Patches for Contrastive Learning
The key idea of our approach is that if two patches (one from the source domain and the other from the target domain) are semantically similar, then their embedding in latent space should also be similar. Conversely, if two patches are semantically dissimilar, their embeddings should also have a large distance.
To find such pairs, let us assume a semantic disparity function , which we define formally in the following section. Patches with high semantic similarity have low disparity values , and vice-versa. We sample a patch in an image of one domain and compute the disparity to all patches in an image from the other domain. Pairs of patches with low disparity score are considered positive () and pairs with high disparity negative (). Pairs with disparity values in-between and are simply ignored. Fig. 3 gives an example of positive and negative pairs.
To define a contrastive loss with the discovered pairs of patches, let us define the query patch sampled from one domain, the positive patch and negative patches sampled from the other domain. Let , and be the corresponding latent vectors. We can then define the contrastive loss as
being the similarity between any two vectors, defined as the exponential of the cosine similarity normalized by temperature parameter.
3.3.2 Patch-Wise Semantic Disparity
We need a measure of patch-wise semantic disparity in label space to define positive and negative pairs as described in Sec. 3.3.1. Given a pair of patches, one from source and the other from the target domain, we define a metric accounting for both semantic and structural disparity in label space.
Let us consider patch and semantic map from either domain. Let be the proportion of pixels that have label , for . We take
to be the semantic vector of , containing the overall semantic information without any spatial layout information.
To also encode rough spatial information and allow for robust matching, we adopt a simplified version of spatial pyramid matching  in label space. We compute the semantic vector via Eq. (8) on three spatial levels, as shown in Fig. 4. We define the patch-wise semantic disparity between and , from source and target domains, respectively, as
with . That is, we measure distance between semantic vectors at 3 pyramid levels , each with patches. and denote the -th sub-patch of and at level . The first spatial level covers the whole patch, hence =1.
At the second and third levels, we split the patch into 4 and 16 sub-patches, respectively, and set and . The coefficient in Eq. (9) ensures equal contribution from all levels.
3.4 Training Strategy
which we minimize with respect to the network parameters. We introduce weighting factors and to balance the impact of individual loss terms.
We first train a network only with
, which we use to estimate pseudo labels. Then, our network is re-initialized and re-trained with the full loss. The contrastive loss operates on both labeled target and pseudo-labeled target data, where we use a lower weight for the one operating on pseudo-labeled data. Note again that our training approach does not require adversarial learning objectives for domain adaptation.
3.5 Implementation Details
, which extracts the latent variables for our contrastive loss, is more sophisticated. It uses an average pooling layer followed by a two-layer perceptron to project the feature patchesof the encoder (see Eq. (1)) into the latent space in which the contrastive loss Eq. (7) is computed. We add decoder to multiple intermediate layers as suggested by previous work . Note that is only required at training time.
To improve our overall adaptation quality, we employ the recently proposed Fourier Domain Adaptation (FDA) method 
. It translates source domain images to the target domain by swapping the low-frequency component of the spectrum of the source image with that of a randomly selected target one. The strength of this approach is that the translation is very simple as it happens directly in the image space without any deep neural network. Thus, the source domain images we use in Eq. (2) are translated using FDA.
We use the following hyper-parameters: As in , we set to and to . We then tested different values of in Eq. (10). We set it to for target labels sampled from ground truth annotation and to for pseudo labels as mentioned in Sec. 3.4.
The temperature parameter in Eq. (7) is set to during training and the patch size for the contrastive loss is set to pixels in image space. We use and as thresholds to define positive and negative patch pairs. Our model is trained using SGD with initial learning rate and adjusted according to the ‘poly’ learning rate scheduler with a power of and weight decay , following .
We evaluate our proposed method on both unsupervised (UDA) and semi-supervised (SSDA) domain adaptation. ASS  is the only semantic segmentation work we know of that operates in the SSDA setting, , assumes full annotation in the source domain and partial annotations in target domain. For a fair comparison against other state-of-the-art methods, we extend the following baselines to handle both UDA and SSDA setups:
FDA  translates source domain images to the target style by swapping low-frequency values in Fourier space and leveraging self-training to refine the estimation.
MinEnt  minimizes an entropy loss to penalize low-confidence predictions in the target domain.
AdvEnt  minimizes the same entropy loss as MinEnt and also performs structure adaptation from source to target domain in an adversarial setting.
Universal  introduces a pixel-level entropy regularization scheme to perform feature alignment among multiple domains.
The first three were designed for UDA while the fourth performs SSDA but assumes partial annotations in both domains.
To also test FDA, MinEnt, and AdvEnt in a semi-supervised context we modified them to also leverage annotated target domain data. To this end, we train them by minimizing their original objective function along with the loss from Eq. (3) for supervised target domain data. For Universal, we used all the source domain annotations, replaced the original network by the same DeepLabV2  network as for all the other baselines and kept the rest of the method as in the original work, which boosted its performance and makes the comparison fair.
|(a) Image||(b) GT||(c) FDA||(d) OURS|
(a) CityScapes images. (b) Ground-truth semantic segmentations. (c) FDA results given 50 labeled target domain images. (d) Our results given the same 50 labeled target domain images. Our semantics maps tend to be smoother and to preserve the scene structure better. Note, for example, the building in the first and fourth row; the road in the second, third, and fourth row; and the sidewalk in the fifth row.
To compare these baselines against our approach, we use one real-world dataset and two synthetic ones and follow the same protocols as in these earlier methods for a fair comparison. They are:
contains synthetic images captured from a video game. As in [48, 49], we resize the images to and randomly crop them to during training. The original dataset features different categories of pixel-wise semantic labels but we only use the classes that are shared with CityScapes as on our baselines [43, 45, 49].
We use CityScapes as the target domain and either GTA5 or SYNTHIA as the source domain. We refer to the two resulting tasks as GTA5CityScapes and SYNTHIACityScapes. We use all the labels from the source domain and either none for the target domain or only a handful, which we define as the unsupervised and semi-supervised cases, respectively.
|mIoU on # Labeled|
4.3 Comparative Results
In this section, we compare our results against the baselines and report the results in Tables 1 and 2 as a function of the number of annotated images in the target domain. corresponds to the UDA setting while , , , and denote SSDA as in . We provide qualitative results in Fig. 5.
Our approach outperforms others in overall mIoU and in most individual categories. ASS  delivers the best performance in some categories but still does worse than our proposed method overall. Note that the performance gap is highest in the SSDA setting where we use only a small number of annotated target domain samples, such as 50. This is significant for many practical applications: Annotating 50 images is typically possible and therefore worth doing given the boost it provides.
Moreover, the ASS approach is complementary to ours and both could be used jointly. The adversarial loss of ASS could be used as an extra loss term in our approach, which is something worth exploring in future work.
4.4 Further Weakening the Annotations
To further drive down the annotation cost and increase its practical appeal, we not only restrict the number of annotated images in the target domain but also perform only partial annotations within these images.
To this end, we split the 200 labeled target domain images into pixel blocks as in  and, instead of annotating all of them, we randomly annotate only a subset and fill the others with pseudo labels.
We then use our approach as described above. We report the results in Tab. 2(a) as a function of the annotated percentage of each one of the 200 target domain images we use for this purpose. As can be seen in the table, we can further cut the annotation cost in the target domain by with only a slight performance drop. Interestingly, comparing Tab. 2(a) and Tab. 1 shows that annotating of the blocks in 200 images delivers much better performance than fully annotating 50 images, while representing about the same annotation effort.
4.5 Ablation Study
Quality of Pseudo Labels.
To analyze how the quality of the pseudo labels affects our contrastive loss, we evaluate it using pseudo labels of different quality. Let the pseudo labels generated by the model trained using , , and annotated target domain images be the low, median, and high quality ones, respectively. We then use these labels to compute the contrastive loss but compute the other losses as if we had 200 annotated target domain images. As shown in Tab. 2(b), the high quality labels give the best results, which confirms their importance.
To highlight that our approach is easier to train than methods based on adversarial learning, we compare the training loss between OURS and AdvEnt. As shown in Fig. 6, our training curve is much better behaved.
Hyper-Parameters and Design Choices.
Finally, we demonstrate the impact of various hyper-parameter and design choices on our model with labeled real images.
Tables 2(c), 2(d) and 2(e) show the influence of hyper-parameters specific to the contrastive loss: The temperature parameter , the thresholds and that define positive and negatives pairs of patches, and the patch size itself.
In Tab. 2(f), we compare the pyramid matching algorithm of Section 3.3.2 against a simplified Exact matching scheme that computes the Hamming distance between the two flattened label patches. As expected, our more sophisticated scheme delivers better results.
In Tab. 2(g), we analyze the weighting of the contrastive loss terms applied on real and pseudo ground truth, respectively. Even though both ground truth and pseudo-labels are useful, using higher weights of Eq. 10 for the ground truth labels than for the pseudo labels is advisable.
We analyze the impact of the individual loss terms of our model in Tab. 2(h) with and without using FDA-adapted input images. All the proposed loss terms improve the performance in both cases. Interestingly, the gain of using FDA becomes smaller when adding the contrastive loss, indicating its potential for domain adaptation.
Finally, we compare our training strategy and use of the contrastive loss against a pre-training strategy akin to that of . In Tab. 2(i), OURS-PRE is similar to OURS but uses the contrastive loss only for model pre-training as opposed to using it jointly with the semantic segmentation loss term during the training phrase as we normally do. As the table shows, this is less effective.
We introduced a new domain adaptation algorithm for semantic segmentation. Our main contribution is a novel patch-wise contrastive loss that aligns image sub-regions across domains when they exhibit similar structures in label space. It enables our algorithm to outperform state-of-the-art methods both in unsupervised and semi-supervised scenarios.
We have shown that our approach naturally extends to the weakly-supervised case in which we annotate the images only partially. In future work, we will leverage this ability to implement an active learning scheme in which the image blocks to be annotated are chosen automatically. This should result in an even lower-cost and highly practical semi-supervised approach.
Acknowledgments This work was completed during an internship at Amazon Prime Air and supported in part by the Swiss National Science Foundation.
-  (2005) Towards Ultimate Motion Estimation: Combining Highest Accuracy with Real-Time Performance. In iccv, Cited by: §3.2.
-  (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. In nips, Cited by: §2, §4.5.
-  (2017) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. pami. Cited by: §3.5, §4.1.
-  (2019) Domain Adaptation for Semantic Segmentation with Maximum Squares Loss. In iccv, Cited by: §2.
-  (2020) A Simple Framework for Contrastive Learning of Visual Representations. In iclr, Cited by: §1, §2.
-  (2018) ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes. In cvpr, Cited by: §2.
-  (2017) No More Discrimination: Cross City Adaptation of Road Scene Segmenters. In iccv, Cited by: §2.
The Cityscapes Dataset for Semantic Urban Scene Understanding. In cvpr, Cited by: §4.2.
-  (2017) Multi-task Self-Supervised Visual Learning. In iccv, Cited by: §2.
-  (2014) Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In nips, Cited by: §2.
-  (2020) Contrastive Learning for Weakly Supervised Phrase Grounding. In eccv, Cited by: §2.
-  (2006) Dimensionality Reduction by Learning an Invariant Mapping. In cvpr, Cited by: §2.
-  (2020) Momentum Contrast for Unsupervised Visual Representation Learning. In cvpr, Cited by: §2.
-  (2016) Deep Residual Learning for Image Recognition. In cvpr, Cited by: §3.5.
-  (2016) FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation. In arXiv:1612.02649, 2016, Cited by: §2.
Conditional Generative Adversarial Network for Structured Domain Adaptation. In cvpr, Cited by: §2.
-  (2020) Contextual-Relation Consistent Domain Adaptation for Semantic Segmentation. In eccv, Cited by: §2.
-  (2018) CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In icml, Cited by: §2.
-  (2019) Invariant Information Clustering for Unsupervised Image Classification and Segmentation. In iccv, Cited by: §2.
-  (2019) Universal Semi-Supervised Semantic Segmentation. In iccv, Cited by: §1, §2, 4th item, Table 1, Table 2.
-  (2020) Supervised contrastive learning. External Links: Cited by: §2.
-  (2020) Attract, Perturb, and Explore: Learning a Feature Alignment Network for Semi-supervised Domain Adaptation. In eccv, Cited by: §2.
-  (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In , Vol. 2, pp. 2169–2178. External Links: Cited by: §3.3.2.
-  (2020) Content-Consistent Matching for Domain Adaptive Semantic Segmentation. In eccv, Cited by: §2.
-  (2020) Context-Aware Group Captioning via Self-Attention and Contrastive Features. In cvpr, Cited by: §2.
-  (2019) Constructing Self-motivated Pyramid Curriculums for Cross-Domain Semantic Segmentation: A Non-Adversarial Approach. In iccv, Cited by: §2.
-  (2019) Block annotation: better image annotation with sub-image decomposition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 5289–5299. External Links: Cited by: §4.4.
-  (2020) Cross-Domain Semantic Segmentation via Domain-Invariant Interactive Relation Transfer. In cvpr, Cited by: §2.
Self-Supervised Learning of Pretext-Invariant Representations. In cvpr, Cited by: §2.
-  (2020) Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision. In cvpr, Cited by: §2.
-  (2020) Contrastive Learning for Unpaired Image-to-Image Translation. In eccv, Cited by: §2, §3.5.
-  (2020) Domain Adaptive Semantic Segmentation Using Weak Labels. In eccv, Cited by: §2, §2.
-  (2016) Playing for Data: Ground Truth from Computer Games. In eccv, Cited by: §4.2.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In miccai, pp. 234–241. Cited by: §2.
-  (2016) The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In cvpr, Cited by: §4.2.
-  (2019) Beyond Sharing Weights for Deep Domain Adaptation. pami 41 (4), pp. 801–814. Cited by: §2.
-  (2019) Semi-supervised Domain Adaptation via Minimax Entropy. In iccv, Cited by: §2.
-  (2018) Adversarial Dropout Regularization. In iclr, Cited by: §2.
-  (2018) Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation. In cvpr, Cited by: §2.
-  (2020) Learning from Scale-Invariant Examples for Domain Adaptation in Semantic Segmentation. In eccv, Cited by: §2.
-  (2020) Contrastive Multiview Coding . In eccv, Cited by: §2.
-  (2018) Learning to Adapt Structured Output Space for Semantic Segmentation. In cvpr, Cited by: §2.
-  (2019) ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. In cvpr, Cited by: §1, §2, §3.3, §3.5, Figure 6, 2nd item, 3rd item, §4.2, §4.2, Table 1, Table 2.
-  (2019) DADA: Depth-Aware Domain Adaptation in Semantic Segmentation. In iccv, Cited by: §2.
-  (2020) Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation. In cvprw, Cited by: §1, §2, §3.3, §3.5, §4.1, §4.2, §4.2, §4.3, §4.3, Table 1, Table 2.
-  (2018) DCAN: Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation. In eccv, Cited by: §2.
-  (2018) Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination . In cvpr, Cited by: §2.
-  (2020) Phase Consistent Ecological Domain Adaptation. In cvpr, Cited by: §2, §4.2, §4.2.
-  (2020) FDA: Fourier Domain Adaptation for Semantic Segmentation. In cvpr, Cited by: §1, §2, §3.2, §3.5, §3.5, §3.5, §3.5, Figure 5, 1st item, §4.2, §4.2, §4.2, Table 1, Table 2.
-  (2019) Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In cvpr, Cited by: §2.
-  (2018) Penalizing Top Performers: Conservative Loss for Semantic Segmentation Adaptation. In eccv, Cited by: §2.
Local Aggregation for Unsupervised Learning of Visual Embeddings. In cvpr, Cited by: §2.
-  (2018) Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training. In eccv, Cited by: §2.