Domain Adaptation for Semantic Segmentation via Patch-Wise Contrastive Learning

04/22/2021 ∙ by Weizhe Liu, et al. ∙ 41

We introduce a novel approach to unsupervised and semi-supervised domain adaptation for semantic segmentation. Unlike many earlier methods that rely on adversarial learning for feature alignment, we leverage contrastive learning to bridge the domain gap by aligning the features of structurally similar label patches across domains. As a result, the networks are easier to train and deliver better performance. Our approach consistently outperforms state-of-the-art unsupervised and semi-supervised methods on two challenging domain adaptive segmentation tasks, particularly with a small number of target domain annotations. It can also be naturally extended to weakly-supervised domain adaptation, where only a minor drop in accuracy can save up to 75 annotation cost.



There are no comments yet.


page 1

page 3

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given large amounts of annotated training data, current fully-supervised semantic segmentation algorithms deliver outstanding results. Because annotating many images at the pixel-level is expensive, a common practice is to generate synthetic data and to rely on unsupervised domain adaptation to bridge the gap from the synthetic source to the real-world target domain.

In this paper, we introduce a novel approach to aligning cross-domain features for both unsupervised and semi-supervised domain adaptation. In the first case, no target domain labels are given, whereas in the second one only a small amount of annotated data is available in the target domain. In practice, this second scenario is important because even a handful of target domain labels can boost the performance significantly.

Figure 1: Patch-wise structural disparity. Patches from both domains are divided into sub-patches at several resolution levels. We then compare their label space disparity at all levels and use contrastive learning to make similar patches have similar feature distributions.

Our key insight is that patches from both domains that are structurally similar in label space should also have similar distributions in feature space. To enforce this, which goes beyond what can be done using ordinary pixel-wise similarity, we introduce a patch-wise metric. It measures label disparity at several levels of resolution along with a contrastive loss [5] that, when minimized, aligns the feature distribution closer for patches with similar structures in label space and pushes them apart otherwise, see Fig. 1.

To perform unsupervised or semi-supervised training, we incorporate unlabeled pixels into training by using pseudo labels that we iteratively generate using the output of the partially trained network. At each training iteration, we use our patch-wise metric to decrease the feature space disparity of patches that are structurally similar and to increase it for those that are not. This renders a more straightforward approach than adversarial learning, previously often used for cross-domain feature alignment. We do not require an extra discriminator network and therefore eliminate the sometimes substantial difficulty of having to train it.

Our experiments show that our approach delivers a performance increase over state-of-the-art methods in the unsupervised regime [43, 49] and an even larger boost in the semi-supervised one [45, 20]. In a practical setting, we believe the latter to be particularly significant because, while it is very tedious to get huge amounts of annotated frames, it is almost always possible to supply a few annotated frames. Our contribution is therefore a new contrastive learning approach to aligning features across different domains for semantic segmentation. It relies on structural label disparity instead of adversarial learning and outperforms state-of-the-art methods for unsupervised domain adaptation and even more for semi-supervised domain adaptation. We also show how our approach can be extended to weakly-supervised domain adaptation, where only a minor drop in accuracy can save up to of annotation cost.

2 Related Work

Most recent works on domain adaptation focus on unsupervised methods, with only very few incorporating the limited amount of supervision we advocate. We briefly review these two classes of approaches and then discuss contrastive learning, which is a central component of our approach.

Unsupervised Domain Adaptation (UDA) for Semantic Segmentation

aims to align the source and target domain feature distributions given annotated data only in the source domain. A popular approach is to leverage adversarial learning to generate domain-invariant features. This trend started with [15] and was extended to different levels of representation, including feature space [6, 16, 32, 38] and label space [7, 17, 30, 42, 43, 44]. Notable extensions are [4, 7, 32] which enforce class-wise alignment to narrow the distributions to be matched. However, all these adversarial learning approaches rely on extra discriminator networks which are complicated and hard to train jointly with the generator network.

Other widely used UDA approaches to semantic segmentation include generating realistic-looking synthetic images [18, 39, 46, 48, 49, 51], using pseudo labels for self-training [24, 26, 40, 49, 53], and leveraging weak labels [28, 32]. Among these methods, FDA [49] is a simple approach which achieves state-of-the-art performance by generating realistic-looking synthetic images directly in Fourier space and self-training.

Our approach builds on FDA and incorporates a novel domain-wise contrastive loss, without the reliance on complex adversarial learning.

Semi-Supervised Domain Adaptation for Semantic Segmentation (SSDA)

assumes that a handful of target domain labels are available. In the context of semantic segmentation it has not received as much attention as UDA. For example, [37]

achieves adaptation by alternately maximizing the conditional entropy of unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder. In 

[36], a two-stream architecture is proposed, where one stream operates in the source and the other in the target domain. In contrast to others, the weights in corresponding layers are related but not shared and optimized to deliver good performance in both domains. In [22], target domain samples are perturbed to reduce intra-domain discrepancy using adversarial learning.

However, none of these works have been demonstrated for semantic segmentation, nor do they leverage contrastive learning for SSDA. The only ones that do are [45] and [20]. In [45], class-wise adversarial learning is used to promote the similarity of pixel-level feature representations for the same classes. In [20], a pixel-level entropy regularization scheme is introduced to favor feature alignment among multiple domains. Therefore, domain alignment is only enforced on the pixel-level, whereas ours is done at a more semantic level.

Contrastive Learning (CL)

aims to learn visual representations by leveraging both similar and dissimilar samples. Early work [12] showcased improved visual representation by contrasting positive pairs against negative ones. Deep CL has been used extensively [10, 47, 52, 41, 13, 29, 5, 25, 9, 50, 19, 21]

for applications such as image classification, image-to-image translation 

[31] or phrase grounding [11].

In the context of semantic segmentation, CL has also been used for intra-domain model pre-training  [2]. The approach is specifically designed for use in conjunction with a U-Net [34] backbone and does not generalize well to other architectures. In our experiments, we show that using CL directly as loss term achieves superior results compared to CL only used for model pre-training.

3 Approach

We first formalize the problem of domain adaptation in Sec. 3.1. We then give an overview of our proposed model in Sec. 3.2 and explain how contrastive learning is used for domain adaptation in Sec. 3.3 and 3.4.

3.1 Problem Formulation

Let be a source-domain dataset, where denotes a color image and the corresponding semantic map. The target-domain dataset is split into two sets. The first is a labeled set with ground truth semantic maps. The second is the unlabeled set without ground truth semantic labels. We have in the Unsupervised Domain Adaptation (UDA) setting and for Semi-Supervised Domain Adaptation (SSDA). In most real-world scenarios, we have and . Given the three data sets , and , the task is to learn a single model that performs well on previously unseen target domain data.

Figure 2: Model Architecture. Images from both domains are fed into the same deep network. Features after each convolutional block of ResNet101 are projected into the latent space by a MLP where the contrastive loss is employed. The contrastive loss minimizes the distance of the latent space between query patch and positive patch, while maximizing the distance for the query patch and negative patches.

3.2 Overview of Our Approach

Our model consists of an encoder that maps an input image

to a list of feature vectors

and two decoders and that map these intermediate features to a list of latent vectors and a semantic map . Let and be the networks that take as input an image and return a semantic map and a latent vector respectively. Fig. 2

gives an overview of our network architecture and the loss functions, which we describe below.

Patch-wise representation:

In practice, the functions and

are convolutional neural networks, but for the sake of describing our contrastive loss in Sec. 

3.3, a patch-wise representation is useful. Therefore, each image is decoded into latent vectors, corresponding to rectangular patches for and the encoder preserves the association between features and patches so that we can write


where and are local features and latent vectors associated to patch . In the remainder of the paper, we will denote by , and patches of the source, labeled and unlabeled target images.

Loss functions for classification:

Training all parameters of our model involves minimizing several loss functions, including the novel contrastive loss that we describe in Sec. 3.3. For the per-pixel classification task, we also use supervised cross-entropy losses for the source domain images and, if available, labeled target domain images, respectively, which are defined as


Additionally, we employ a regularization loss


where is the Charbonnier penalty function [1]. As in [49], penalizes uncertainty in the predictions for the unlabeled samples and encourages one label to dominate over the others.

We summarize these three losses into our base loss


with being a small scaling factor, usually .

Pseudo Labels:

After some initial training steps, we can use to assign pseudo labels to unlabeled images and to compute an additional cross-entropy loss term


where are the pseudo labels. In Sec. 4.5, we demonstrate that pseudo labeling becomes more effective in the SSDA setting, as compared to an unsupervised setting, because pseudo labels are more reliable.

3.3 Contrastive Learning for Domain Adaptation

The main contribution of this paper is to leverage contrastive learning for domain adaptation. Instead of relying on an adversarial training scheme, as most prior works do [45, 43], we use a contrastive loss on pairs of patches from different domains. The goal is to bring the representation of positive pairs closer together, while pushing negative pairs apart. A benefit of our approach is that optimizing this loss is relatively simpler, compared to adversarial learning, which involves a min-max optimization scheme. However, the main challenge is to define positive and negative pairs of patches across domains for the contrastive loss.

3.3.1 Matching of Patches for Contrastive Learning

The key idea of our approach is that if two patches (one from the source domain and the other from the target domain) are semantically similar, then their embedding in latent space should also be similar. Conversely, if two patches are semantically dissimilar, their embeddings should also have a large distance.

To find such pairs, let us assume a semantic disparity function , which we define formally in the following section. Patches with high semantic similarity have low disparity values , and vice-versa. We sample a patch in an image of one domain and compute the disparity to all patches in an image from the other domain. Pairs of patches with low disparity score are considered positive () and pairs with high disparity negative (). Pairs with disparity values in-between and are simply ignored. Fig. 3 gives an example of positive and negative pairs.

To define a contrastive loss with the discovered pairs of patches, let us define the query patch sampled from one domain, the positive patch and negative patches sampled from the other domain. Let , and be the corresponding latent vectors. We can then define the contrastive loss as



being the similarity between any two vectors, defined as the exponential of the cosine similarity normalized by temperature parameter


Figure 3: Example of Semantic Disparity. We visualize the semantic disparity between one label patch from the source domain and three label patches from the target domain. As can be seen, our metric clearly demonstrates both semantic and spatial disparity.

3.3.2 Patch-Wise Semantic Disparity

We need a measure of patch-wise semantic disparity in label space to define positive and negative pairs as described in Sec. 3.3.1. Given a pair of patches, one from source and the other from the target domain, we define a metric accounting for both semantic and structural disparity in label space.

Figure 4: Label Space Spatial Pyramid Matching. A patch is cropped from the full label mask and divided into 3 different levels, from coarse to fine. The semantic information is calculated as the normalized histograms over categories for each sub-patch in all three levels.

Let us consider patch and semantic map from either domain. Let be the proportion of pixels that have label , for . We take


to be the semantic vector of , containing the overall semantic information without any spatial layout information.

To also encode rough spatial information and allow for robust matching, we adopt a simplified version of spatial pyramid matching [23] in label space. We compute the semantic vector via Eq. (8) on three spatial levels, as shown in Fig. 4. We define the patch-wise semantic disparity between and , from source and target domains, respectively, as


with . That is, we measure distance between semantic vectors at 3 pyramid levels , each with patches. and denote the -th sub-patch of and at level . The first spatial level covers the whole patch, hence =1.

At the second and third levels, we split the patch into 4 and 16 sub-patches, respectively, and set and . The coefficient in Eq. (9) ensures equal contribution from all levels.

3.4 Training Strategy

We can now put the individual losses Eq. (5), (6) and (7) together to define our overall training objective


which we minimize with respect to the network parameters. We introduce weighting factors and to balance the impact of individual loss terms.

We first train a network only with

, which we use to estimate pseudo labels. Then, our network is re-initialized and re-trained with the full loss

. The contrastive loss operates on both labeled target and pseudo-labeled target data, where we use a lower weight for the one operating on pseudo-labeled data. Note again that our training approach does not require adversarial learning objectives for domain adaptation.

3.5 Implementation Details

We use DeepLabV2 [3] as our semantic segmentation network , with the encoder being a ResNet101 [14] backbone. This choice enables a fair comparison with prior works [45, 43, 49]. The decoder network

, which extracts the latent variables for our contrastive loss, is more sophisticated. It uses an average pooling layer followed by a two-layer perceptron to project the feature patches

of the encoder (see Eq. (1)) into the latent space in which the contrastive loss Eq. (7) is computed. We add decoder to multiple intermediate layers as suggested by previous work [31]. Note that is only required at training time.

To improve our overall adaptation quality, we employ the recently proposed Fourier Domain Adaptation (FDA) method [49]

. It translates source domain images to the target domain by swapping the low-frequency component of the spectrum of the source image with that of a randomly selected target one. The strength of this approach is that the translation is very simple as it happens directly in the image space without any deep neural network. Thus, the source domain images we use in Eq. (

2) are translated using FDA.

We use the following hyper-parameters: As in [49], we set to and to . We then tested different values of in Eq. (10). We set it to for target labels sampled from ground truth annotation and to for pseudo labels as mentioned in Sec. 3.4.

The temperature parameter in Eq. (7) is set to during training and the patch size for the contrastive loss is set to pixels in image space. We use and as thresholds to define positive and negative patch pairs. Our model is trained using SGD with initial learning rate and adjusted according to the ‘poly’ learning rate scheduler with a power of and weight decay , following [49].

4 Experiments

4.1 Baselines

We evaluate our proposed method on both unsupervised (UDA) and semi-supervised (SSDA) domain adaptation. ASS [45] is the only semantic segmentation work we know of that operates in the SSDA setting, , assumes full annotation in the source domain and partial annotations in target domain. For a fair comparison against other state-of-the-art methods, we extend the following baselines to handle both UDA and SSDA setups:

  • FDA [49] translates source domain images to the target style by swapping low-frequency values in Fourier space and leveraging self-training to refine the estimation.

  • MinEnt [43] minimizes an entropy loss to penalize low-confidence predictions in the target domain.

  • AdvEnt [43] minimizes the same entropy loss as MinEnt and also performs structure adaptation from source to target domain in an adversarial setting.

  • Universal [20] introduces a pixel-level entropy regularization scheme to perform feature alignment among multiple domains.

The first three were designed for UDA while the fourth performs SSDA but assumes partial annotations in both domains.

To also test FDA, MinEnt, and AdvEnt in a semi-supervised context we modified them to also leverage annotated target domain data. To this end, we train them by minimizing their original objective function along with the loss from Eq. (3) for supervised target domain data. For Universal, we used all the source domain annotations, replaced the original network by the same DeepLabV2 [3] network as for all the other baselines and kept the rest of the method as in the original work, which boosted its performance and makes the comparison fair.

(a) Image (b) GT (c) FDA (d) OURS
Figure 5: Qualitative results.

(a) CityScapes images. (b) Ground-truth semantic segmentations. (c) FDA 

[49] results given 50 labeled target domain images. (d) Our results given the same 50 labeled target domain images. Our semantics maps tend to be smoother and to preserve the scene structure better. Note, for example, the building in the first and fourth row; the road in the second, third, and fourth row; and the sidewalk in the fifth row.

4.2 Datasets

To compare these baselines against our approach, we use one real-world dataset and two synthetic ones and follow the same protocols as in these earlier methods for a fair comparison. They are:

CityScapes [8]

is a real-world benchmark dataset for semantic segmentation featuring driving scenes and dense manual annotations. We use the images from the training set as the target domain for training. We test on the validation images and resize the images to without any random cropping, as in [48, 49].

Gta5 [33]

contains synthetic images captured from a video game. As in [48, 49], we resize the images to and randomly crop them to during training. The original dataset features different categories of pixel-wise semantic labels but we only use the classes that are shared with CityScapes as on our baselines [43, 45, 49].

Synthia [35]

is also a synthetic dataset. As others [43, 45, 49], we use the SYNTHIA-RAND-CITYSCAPES subset which comprises annotated images. During training, we also randomly crop the image to and evaluate with classes to follow the standard protocol.

We use CityScapes as the target domain and either GTA5 or SYNTHIA as the source domain. We refer to the two resulting tasks as GTA5CityScapes and SYNTHIACityScapes. We use all the labels from the source domain and either none for the target domain or only a handful, which we define as the unsupervised and semi-supervised cases, respectively.

# labeled























MinEnt [43] 84.4 18.7 80.6 23.8 23.2 28.4 36.9 23.4 83.2 25.2 79.4 59.0 29.9 78.5 33.7 29.6 1.7 29.9 33.6 42.30
AdvEnt [43] 89.9 36.5 81.6 29.2 25.2 28.5 32.3 22.4 83.9 34.0 77.1 57.4 27.9 83.7 29.4 39.1 1.5 28.4 23.3 43.80
FDA [49] 92.5 53.3 82.4 26.5 27.6 36.4 40.6 38.9 82.3 39.8 78.0 62.6 34.4 84.9 34.1 53.1 16.9 27.7 46.4 50.45
OURS 93.3 54.2 83.0 25.9 28.1 37.2 41.1 39.3 83.1 38.9 78.2 61.3 36.2 84.2 35.8 54.0 18.1 26.7 47.5 50.85


MinEnt [43] 92.6 56.8 82.5 27.5 27.0 36.6 29.7 36.8 84.6 36.7 78.2 60.5 19.4 81.1 36.4 40.0 4.6 19.4 52.3 47.51
AdvEnt [43] 90.1 43.6 81.0 23.2 27.4 34.9 27.5 28.0 84.6 26.6 75.7 60.9 24.1 76.8 39.1 42.7 8.3 17.6 41.9 44.94
FDA [49] 93.7 59.6 84.3 31.7 31.0 37.0 36.5 42.1 86.5 44.8 80.9 63.7 31.9 86.7 47.2 53.2 13.2 33.8 50.4 53.10
ASS [45] 94.3 63.0 84.5 26.8 28.0 38.4 35.5 48.7 87.1 39.2 88.8 62.2 16.3 87.6 23.2 39.2 7.2 24.4 58.1 50.10
Universal [20] 91.2 44.1 82.4 25.0 29.2 36.1 29.0 32.5 81.9 29.4 77.1 64.0 30.2 73.5 45.0 45.9 11.2 18.2 44.6 46.87
OURS 94.4 63.6 85.3 26.4 31.3 40.3 41.5 53.0 87.1 43.4 85.4 62.9 30.1 88.0 48.3 55.2 8.3 34.0 51.0 54.17


MinEnt [43] 93.0 56.3 83.2 27.5 26.0 37.1 31.9 40.4 85.2 42.5 82.3 60.6 27.3 83.7 41.8 45.3 1.6 20.9 45.1 49.02
AdvEnt [43] 91.3 51.0 82.2 23.2 26.1 37.3 32.3 33.0 84.3 34.3 77.0 61.7 30.0 84.0 39.3 42.3 0.3 16.2 45.1 46.89
FDA [49] 94.4 62.2 85.2 32.7 32.6 38.6 39.5 47.6 86.8 48.9 85.1 64.5 36.8 87.5 46.1 53.3 0.7 33.0 52.4 54.10
ASS [45] 96.0 71.7 85.9 27.9 27.6 42.8 44.7 55.9 87.7 46.9 89.0 66.0 36.4 88.4 28.9 21.4 11.4 38.0 63.2 54.20
Universal [20] 92.2 57.4 84.1 28.7 22.8 41.2 33.0 41.7 83.9 45.8 80.7 61.1 29.4 80.8 41.5 42.9 6.2 11.5 49.1 49.16
OURS 94.6 66.1 86.3 26.2 34.9 38.9 40.5 52.9 87.3 49.2 87.9 64.7 36.8 88.6 48.4 45.0 2.8 38.8 58.5 55.17


MinEnt [43] 93.9 59.0 84.7 22.4 30.2 38.4 36.2 47.3 86.0 42.4 82.4 62.1 33.0 87.1 44.2 39.0 28.5 20.3 52.3 52.07
AdvEnt [43] 92.7 58.4 84.0 22.4 30.1 36.3 37.3 43.4 85.6 35.7 83.4 61.3 28.0 80.9 39.5 41.6 22.9 19.0 51.7 50.23
FDA [49] 94.7 62.8 85.9 32.9 31.9 38.5 41.9 51.7 86.9 47.5 85.2 64.8 36.7 87.9 48.6 58.2 19.2 37.0 54.9 56.16
ASS [45] 96.1 71.9 85.8 28.4 29.8 42.5 45.0 56.2 87.4 45.0 88.7 65.8 38.2 89.6 42.2 35.9 17.1 35.8 61.6 56.00
Universal [20] 93.1 64.8 82.1 21.5 25.8 39.0 39.8 43.4 82.7 47.9 83.8 63.6 35.8 83.2 49.0 38.1 19.8 24.9 50.4 52.04
OURS 94.2 62.2 85.9 29.0 35.0 41.7 45.7 55.4 87.4 49.1 86.1 65.8 40.2 88.8 47.6 58.1 12.7 40.2 57.2 56.96


MinEnt [43] 94.8 65.6 86.0 32.1 37.3 40.7 40.5 53.7 86.7 44.8 86.3 63.2 31.0 86.3 49.8 53.6 13.1 26.7 58.1 55.26
AdvEnt [43] 94.9 67.5 85.7 28.4 36.3 41.3 40.8 53.1 87.3 45.6 84.4 64.2 36.4 87.5 50.7 46.9 13.4 29.2 59.4 55.42
FDA [49] 94.6 64.9 86.6 36.3 38.9 42.6 46.2 56.4 87.8 52.4 85.9 65.9 39.4 89.3 56.4 62.0 23.6 39.3 56.1 59.19
ASS [45] 96.2 72.7 87.6 35.1 31.7 46.6 46.9 62.7 88.7 49.6 90.5 69.2 42.7 91.1 52.6 60.9 9.6 43.1 65.6 60.20
Universal [20] 94.2 72.8 84.9 29.9 38.5 43.5 42.7 55.4 85.8 48.0 83.5 66.1 39.2 86.4 51.8 48.5 16.9 31.5 61.1 56.88
OURS 94.7 66.7 87.4 31.5 42.5 42.2 47.0 58.5 87.7 49.7 88.5 68.1 44.5 89.5 62.3 62.1 19.7 42.6 63.2 60.43


MinEnt [43] 95.5 70.4 86.7 33.3 35.7 41.6 44.4 55.9 87.3 46.1 87.5 64.8 38.3 88.2 45.8 59.4 34.0 34.0 60.6 58.40
AdvEnt [43] 95.4 70.1 86.3 35.9 38.7 40.5 43.7 55.4 87.7 50.5 87.1 65.6 40.6 87.9 56.9 51.0 10.3 36.7 61.1 57.98
FDA [49] 96.0 71.9 87.2 31.7 39.7 44.0 47.5 59.1 88.0 51.1 88.8 69.4 47.8 89.9 63.0 67.2 36.6 45.9 60.3 62.37
ASS [45] 96.8 76.3 88.5 30.5 41.7 46.5 51.3 64.3 89.1 54.2 91.0 70.7 48.7 91.6 59.9 68.0 40.8 48.0 67.0 64.50
Universal [20] 95.2 74.3 87.6 35.9 37.2 43.8 45.2 53.1 85.4 49.8 84.3 68.3 41.0 86.5 48.3 62.2 37.5 38.6 67.5 60.09
OURS 96.1 72.5 88.5 38.9 47.7 45.8 51.6 61.7 88.9 50.9 89.1 71.4 51.0 91.3 68.1 69.4 32.5 46.2 66.4 64.62
Table 1: Results on GTA5CityScapes: We evaluate on the common classes shared between these two dataset. Our method is consistently better compared to baselines, particularly for few number of labeled target domain images.
mIoU on # Labeled
Method 0 50 100 200 500 1000
MinEnt [43] 44.20 52.86 56.41 57.93 62.54 66.04
AdvEnt [43] 47.60 51.41 55.20 59.64 62.58 66.82
FDA [49] 52.50 58.48 62.03 64.40 66.81 70.17
ASS [45] - 60.70 62.10 64.80 69.80 73.00
Universal [20] - 53.63 57.05 60.26 63.41 68.47
OURS 53.26 61.18 63.39 65.23 70.26 73.09
Table 2: Results on SYNTHIACityScapes: We report mIoU over classes, following the standard protocol. Note that our method consistently outperforms others. Per-class results can be found in supplemental material.

4.3 Comparative Results

In this section, we compare our results against the baselines and report the results in Tables 1 and 2 as a function of the number of annotated images in the target domain. corresponds to the UDA setting while , , , and denote SSDA as in [45]. We provide qualitative results in Fig. 5.

Our approach outperforms others in overall mIoU and in most individual categories. ASS [45] delivers the best performance in some categories but still does worse than our proposed method overall. Note that the performance gap is highest in the SSDA setting where we use only a small number of annotated target domain samples, such as 50. This is significant for many practical applications: Annotating 50 images is typically possible and therefore worth doing given the boost it provides.

Moreover, the ASS approach is complementary to ours and both could be used jointly. The adversarial loss of ASS could be used as an extra loss term in our approach, which is something worth exploring in future work.

4.4 Further Weakening the Annotations

To further drive down the annotation cost and increase its practical appeal, we not only restrict the number of annotated images in the target domain but also perform only partial annotations within these images.

To this end, we split the 200 labeled target domain images into pixel blocks as in [27] and, instead of annotating all of them, we randomly annotate only a subset and fill the others with pseudo labels.

Annotation mIoU
Pseudo Label Quality mIoU
Low 56.48
Median 56.96
High 58.05
Patch Size mIoU
32 16 53.32
64 32 54.17
256 128 53.39
Method mIoU
Exact 53.25
Pyramid 54.17
Loss mIoU w/ FDA mIoU w/o FDA
51.93 50.97
53.06 52.86
54.17 54.08
Method mIoU
OURS-PRE 52.04
OURS 54.17
Table 3: Ablation Study on: (a) Weak annotation; (b) Quality of pseudo label; (c) Temperature parameter; (d) Threshold values for patch matching; (e) Patch size; (f) Matching strategy; (g) Weights of contrastive loss; (h) Contribution of each loss term; (i) Contrastive learning.

We then use our approach as described above. We report the results in Tab. 2(a) as a function of the annotated percentage of each one of the 200 target domain images we use for this purpose. As can be seen in the table, we can further cut the annotation cost in the target domain by with only a slight performance drop. Interestingly, comparing Tab. 2(a) and Tab. 1 shows that annotating of the blocks in 200 images delivers much better performance than fully annotating 50 images, while representing about the same annotation effort.

4.5 Ablation Study

Quality of Pseudo Labels.

To analyze how the quality of the pseudo labels affects our contrastive loss, we evaluate it using pseudo labels of different quality. Let the pseudo labels generated by the model trained using , , and annotated target domain images be the low, median, and high quality ones, respectively. We then use these labels to compute the contrastive loss but compute the other losses as if we had 200 annotated target domain images. As shown in Tab. 2(b), the high quality labels give the best results, which confirms their importance.

Training Stability.

To highlight that our approach is easier to train than methods based on adversarial learning, we compare the training loss between OURS and AdvEnt. As shown in Fig. 6, our training curve is much better behaved.

Figure 6: Training curve. Our training curve is much smoother than the one of AdvEnt [43] because our approach does not rely on adversarial learning.
Hyper-Parameters and Design Choices.

Finally, we demonstrate the impact of various hyper-parameter and design choices on our model with labeled real images.

Tables 2(c), 2(d) and 2(e) show the influence of hyper-parameters specific to the contrastive loss: The temperature parameter , the thresholds and that define positive and negatives pairs of patches, and the patch size itself.

In Tab. 2(f), we compare the pyramid matching algorithm of Section 3.3.2 against a simplified Exact matching scheme that computes the Hamming distance between the two flattened label patches. As expected, our more sophisticated scheme delivers better results.

In Tab. 2(g), we analyze the weighting of the contrastive loss terms applied on real and pseudo ground truth, respectively. Even though both ground truth and pseudo-labels are useful, using higher weights of Eq. 10 for the ground truth labels than for the pseudo labels is advisable.

We analyze the impact of the individual loss terms of our model in Tab. 2(h) with and without using FDA-adapted input images. All the proposed loss terms improve the performance in both cases. Interestingly, the gain of using FDA becomes smaller when adding the contrastive loss, indicating its potential for domain adaptation.

Finally, we compare our training strategy and use of the contrastive loss against a pre-training strategy akin to that of [2]. In Tab. 2(i), OURS-PRE is similar to OURS but uses the contrastive loss only for model pre-training as opposed to using it jointly with the semantic segmentation loss term during the training phrase as we normally do. As the table shows, this is less effective.

5 Conclusion

We introduced a new domain adaptation algorithm for semantic segmentation. Our main contribution is a novel patch-wise contrastive loss that aligns image sub-regions across domains when they exhibit similar structures in label space. It enables our algorithm to outperform state-of-the-art methods both in unsupervised and semi-supervised scenarios.

We have shown that our approach naturally extends to the weakly-supervised case in which we annotate the images only partially. In future work, we will leverage this ability to implement an active learning scheme in which the image blocks to be annotated are chosen automatically. This should result in an even lower-cost and highly practical semi-supervised approach.

Acknowledgments This work was completed during an internship at Amazon Prime Air and supported in part by the Swiss National Science Foundation.


  • [1] A. Bruhn and J. Weickert (2005) Towards Ultimate Motion Estimation: Combining Highest Accuracy with Real-Time Performance. In iccv, Cited by: §3.2.
  • [2] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. In nips, Cited by: §2, §4.5.
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A.L. Yuille (2017) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. pami. Cited by: §3.5, §4.1.
  • [4] M. Chen, H. Xue, and D. Cai (2019) Domain Adaptation for Semantic Segmentation with Maximum Squares Loss. In iccv, Cited by: §2.
  • [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A Simple Framework for Contrastive Learning of Visual Representations. In iclr, Cited by: §1, §2.
  • [6] Y. Chen, W. Li, and L. V. Gool (2018) ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes. In cvpr, Cited by: §2.
  • [7] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. F. Wang, and M. Sun (2017) No More Discrimination: Cross City Adaptation of Road Scene Segmenters. In iccv, Cited by: §2.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The Cityscapes Dataset for Semantic Urban Scene Understanding

    In cvpr, Cited by: §4.2.
  • [9] C. Doersch and A. Zisserman (2017) Multi-task Self-Supervised Visual Learning. In iccv, Cited by: §2.
  • [10] A. Dosovitskiy, J.T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In nips, Cited by: §2.
  • [11] T. Gupta, A. Vahdat, G. Chechik, X. Yang, J. Kautz, and D. Hoiem (2020) Contrastive Learning for Weakly Supervised Phrase Grounding. In eccv, Cited by: §2.
  • [12] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality Reduction by Learning an Invariant Mapping. In cvpr, Cited by: §2.
  • [13] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum Contrast for Unsupervised Visual Representation Learning. In cvpr, Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In cvpr, Cited by: §3.5.
  • [15] J. Hoffman, D. Wang, F. Yu, and T. Darrel (2016) FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation. In arXiv:1612.02649, 2016, Cited by: §2.
  • [16] W. Hong, Z. Wang, M. Yang, and J. Yuan (2018)

    Conditional Generative Adversarial Network for Structured Domain Adaptation

    In cvpr, Cited by: §2.
  • [17] J. Huang, S. Lu, D. Guan, and X. Zhang (2020) Contextual-Relation Consistent Domain Adaptation for Semantic Segmentation. In eccv, Cited by: §2.
  • [18] J.Hoffman, E.Tzeng, T.Park, J.-Y.Zhu, P.Isola, K.Saenko, A. Efros, and T. Darrell. (2018) CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In icml, Cited by: §2.
  • [19] X. Ji, J.F. Henriques, and A. Vedaldi (2019) Invariant Information Clustering for Unsupervised Image Classification and Segmentation. In iccv, Cited by: §2.
  • [20] T. Kalluri, G. Varma, M. Chandraker, and C. V. Jawahar (2019) Universal Semi-Supervised Semantic Segmentation. In iccv, Cited by: §1, §2, 4th item, Table 1, Table 2.
  • [21] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. External Links: 2004.11362 Cited by: §2.
  • [22] T. Kim and C. Kim (2020) Attract, Perturb, and Explore: Learning a Feature Alignment Network for Semi-supervised Domain Adaptation. In eccv, Cited by: §2.
  • [23] S. Lazebnik, C. Schmid, and J. Ponce (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    Vol. 2, pp. 2169–2178. External Links: Document Cited by: §3.3.2.
  • [24] G. Li, G. Kang, W. Liu, Y. Wei, and Y. Yang (2020) Content-Consistent Matching for Domain Adaptive Semantic Segmentation. In eccv, Cited by: §2.
  • [25] Z. Li, Q.Tran, L. Mai, Z. Lin, and A. Yuille (2020) Context-Aware Group Captioning via Self-Attention and Contrastive Features. In cvpr, Cited by: §2.
  • [26] Q. Lian, F. Lv, L. Duan, and B. Gong (2019) Constructing Self-motivated Pyramid Curriculums for Cross-Domain Semantic Segmentation: A Non-Adversarial Approach. In iccv, Cited by: §2.
  • [27] H. Lin, P. Upchurch, and K. Bala (2019) Block annotation: better image annotation with sub-image decomposition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 5289–5299. External Links: Document Cited by: §4.4.
  • [28] F. Lv, T. Liang, X. Chen, and G. Lin (2020) Cross-Domain Semantic Segmentation via Domain-Invariant Interactive Relation Transfer. In cvpr, Cited by: §2.
  • [29] I. Misra and L. V. D. Maaten (2020)

    Self-Supervised Learning of Pretext-Invariant Representations

    In cvpr, Cited by: §2.
  • [30] F. Pan, I. Shin, F. Rameau, S. Lee, and I.So. Kweon (2020) Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision. In cvpr, Cited by: §2.
  • [31] T. Park, A.A. Efros, R. Zhang, and J.-Y. Zhu (2020) Contrastive Learning for Unpaired Image-to-Image Translation. In eccv, Cited by: §2, §3.5.
  • [32] S. Paul, Y.-H. Tsai, S. Schulter, A. K. Roy-Chowdhury, and M. Chandraker (2020) Domain Adaptive Semantic Segmentation Using Weak Labels. In eccv, Cited by: §2, §2.
  • [33] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for Data: Ground Truth from Computer Games. In eccv, Cited by: §4.2.
  • [34] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In miccai, pp. 234–241. Cited by: §2.
  • [35] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In cvpr, Cited by: §4.2.
  • [36] A. Rozantsev, M. Salzmann, and P. Fua (2019) Beyond Sharing Weights for Deep Domain Adaptation. pami 41 (4), pp. 801–814. Cited by: §2.
  • [37] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised Domain Adaptation via Minimax Entropy. In iccv, Cited by: §2.
  • [38] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2018) Adversarial Dropout Regularization. In iclr, Cited by: §2.
  • [39] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa (2018) Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation. In cvpr, Cited by: §2.
  • [40] M. N. Subhani and M. Ali (2020) Learning from Scale-Invariant Examples for Domain Adaptation in Semantic Segmentation. In eccv, Cited by: §2.
  • [41] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive Multiview Coding . In eccv, Cited by: §2.
  • [42] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker (2018) Learning to Adapt Structured Output Space for Semantic Segmentation. In cvpr, Cited by: §2.
  • [43] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez (2019) ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. In cvpr, Cited by: §1, §2, §3.3, §3.5, Figure 6, 2nd item, 3rd item, §4.2, §4.2, Table 1, Table 2.
  • [44] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez (2019) DADA: Depth-Aware Domain Adaptation in Semantic Segmentation. In iccv, Cited by: §2.
  • [45] Z. Wang, Y. Wei, R. Feris, J. Xiong, W.-M. Hwu, T. S. Huang, and H. Shi (2020) Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation. In cvprw, Cited by: §1, §2, §3.3, §3.5, §4.1, §4.2, §4.2, §4.3, §4.3, Table 1, Table 2.
  • [46] Z. Wu, X. Han, Y.-L. Lin, M. G. Uzunbas, T. Goldstein, S. N. Lim, and L. S. Davis (2018) DCAN: Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation. In eccv, Cited by: §2.
  • [47] Z. Wu, Y. Xiong, S.X. Yu, and D. Lin (2018) Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination . In cvpr, Cited by: §2.
  • [48] Y. Yang, D. Lao, G. Sundaramoorthi, and S. Soatto (2020) Phase Consistent Ecological Domain Adaptation. In cvpr, Cited by: §2, §4.2, §4.2.
  • [49] Y. Yang and S. Soatto (2020) FDA: Fourier Domain Adaptation for Semantic Segmentation. In cvpr, Cited by: §1, §2, §3.2, §3.5, §3.5, §3.5, §3.5, Figure 5, 1st item, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [50] M. Ye, X. Zhang, P.C. Yuen, and S.-F. Chang (2019) Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In cvpr, Cited by: §2.
  • [51] X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin (2018) Penalizing Top Performers: Conservative Loss for Semantic Segmentation Adaptation. In eccv, Cited by: §2.
  • [52] C. Zhuang, A.L. Zhai, and D. Yamins (2019)

    Local Aggregation for Unsupervised Learning of Visual Embeddings

    In cvpr, Cited by: §2.
  • [53] Y. Zou, Z. Yu, B. V. Kumar, and J. Wang (2018) Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training. In eccv, Cited by: §2.