Contrastive Learning and Self-Training for Unsupervised Domain Adaptation in Semantic Segmentation

05/05/2021 ∙ by Robert A. Marsden, et al. ∙ University of Stuttgart 0

Deep convolutional neural networks have considerably improved state-of-the-art results for semantic segmentation. Nevertheless, even modern architectures lack the ability to generalize well to a test dataset that originates from a different domain. To avoid the costly annotation of training data for unseen domains, unsupervised domain adaptation (UDA) attempts to provide efficient knowledge transfer from a labeled source domain to an unlabeled target domain. Previous work has mainly focused on minimizing the discrepancy between the two domains by using adversarial training or self-training. While adversarial training may fail to align the correct semantic categories as it minimizes the discrepancy between the global distributions, self-training raises the question of how to provide reliable pseudo-labels. To align the correct semantic categories across domains, we propose a contrastive learning approach that adapts category-wise centroids across domains. Furthermore, we extend our method with self-training, where we use a memory-efficient temporal ensemble to generate consistent and reliable pseudo-labels. Although both contrastive learning and self-training (CLST) through temporal ensembling enable knowledge transfer between two domains, it is their combination that leads to a symbiotic structure. We validate our approach on two domain adaptation benchmarks: GTA5 → Cityscapes and SYNTHIA → Cityscapes. Our method achieves better or comparable results than the state-of-the-art. We will make the code publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal in semantic image segmentation is to assign the correct class label to each pixel. This makes it suitable for complex image-based scene analysis that is required in applications like automated driving. However, in order to generate a labeled training dataset, pixel-level annotation must first be performed by humans. Since detailed manual labeling can take 90 minutes per image [8], it is associated with high costs. A potential workaround would be to generate the images and corresponding segmentation maps synthetically using computer game environments like Grand Theft Auto V (GTA5) [29]. However, even current segmentation models [2, 34, 1] do not generalize well to data from a different domain. In fact, their segmentation performance decreases drastically when there is a discrepancy between the training and test distribution as in a synthetic-to-real scenario.

The research field of unsupervised domain adaptation (UDA) studies how to transfer knowledge from a labeled source domain to an unlabeled target domain. The aim is to achieve the best possible results in the target domain, whereas the performance in the source domain is not considered. Current methods for UDA address the problem by minimizing the distribution discrepancy between the domains while doing a supervised training on source data. The distribution alignment can take place in the pixel space [15, 23, 36, 44], feature space [16, 17, 41, 45], output space [38, 40, 39, 41, 24], or even in several spaces in parallel.

While adversarial training (AT) [10] is commonly used to minimize the distribution discrepancy between domains, it can fail to align the correct semantic categories. This is because adversarial training minimizes the mismatch between global distributions rather than class-specific ones, which can negatively affect the results [7]. This is also true for discrepancy measures such as the maximum mean discrepancy [31], which can be minimized without aligning the correct class distributions across domains [19]. To address the problem of misaligned classes across domains, we rely on contrastive learning (CL) [12]. The basic idea of CL is to encourage positive data pairs to be similar and negative data pairs to be apart. To perform domain adaptation and match the features of the correct semantic categories, positive pairs consist of features from the same class but different domains while negative pairs are from different classes and possibly from different domains [19, 27].

Due to the lack of labels in the target domain, the class of each target feature must be determined based on the predictions of the model. However, since different class prior probabilities can bias the final segmentation layer towards the source domain, we extend our approach with self-training (ST), i.e. using target predictions as pseudo-labels. We start from the observation that target pixels are predicted with high uncertainty

[40]. Moreover, these uncertain predictions may also vary between different classes during training and are therefore usually not considered for self-training. By using a memory-efficient temporal ensemble that combines the predictions of a single network over time [21], we obtain the predictive tendency of the model. This allows us to create robust pseudo-labels even for the uncertain predictions that have high information content. The temporal ensemble has the additional advantage that pseudo-labels are updated directly during training, which reduces the computational complexity compared to a separate stage-wise recalculation [45, 26, 22, 49, 44].

We summarize our contributions as follows: First, we extend contrastive learning and self-training using a memory-efficient temporal ensemble to UDA for semantic segmentation. Second, we empirically show that both approaches are able to transfer knowledge between domains. Furthermore, we show that combining our contrastive learning and self-training (CLST) approach leads to a symbiotic setup that yields competitive and superior state-of-the-art results for GTA5 Cityscapes and SYNTHIA Cityscapes, respectively.

2 Related works

While the focus of UDA was initially on classification, interest in UDA for semantic segmentation has grown rapidly in the last few years. Since this work investigates UDA for semantic segmentation, the following literature review will mainly focus on this topic. In addition, we give a short review on self-ensembling and contrastive learning.

Adversarial Learning: AT can be applied to align the distributions of two domains in pixel space, feature space, and output space. In pixel space, the main idea is to transfer the appearance of one domain to the style of other. Thus, it is assumed that the geometric structure in both domains is approximately the same and the difference is mainly in texture and color. A very common approach uses a CycleGan-based architecture [48] to transfer source images to the style of the target domain [15, 23, 36]. Since the transformation does not change the content, the source labels can be reused to train the model on target-like images in a supervised manner. Similarly, [7] uses adaptive instance normalization [18]

in combination with a single Generative Adversarial Network

[10]

to augment source images into different target-like styles. Another option is to use adversarial training to minimize the discrepancy between feature distributions. In this case, the discriminator takes on the role of a domain classifier, which must decide whether the features belong to the source or target domain. For semantic segmentation, this approach was first proposed in

[16] and is also used in [15, 36, 32]. In [6], this approach is extended by additionally giving each class its own domain classifier, which further matches the individual class distributions between domains. [38, 40, 39] use adversarial training to align the output spaces on pixel-level and patch level, respectively. Meanwhile, aligning the output distributions with adversarial training is also used in several publications as either a basic component or for warm-up [41, 46, 45].

Self-Training: In ST, the target predictions are converted into pseudo-labels, which are then used to minimize the cross-entropy. Since the quality of the pseudo-labels is crucial to this approach, [44] combines the predictions of three models and iteratively repeats the training of the models, followed by the recalculation of the pseudo-labels. Similarly, [46] improves their quality by averaging the predictions of two different outputs of the same network. Another commonly used strategy tries to convert only the correct target predictions into pseudo-labels. [32], for example, attempts to find them by combining the output of two discriminators with the confidence of a segmentation classifier. [47]

does this by considering the pixel-wise uncertainty of the predictions, which is estimated during training.

[45] converts predictions into pseudo-labels only if the target features are within a certain range of the nearest category-wise source feature mean. [50, 41, 22, 26]

use the softmax output as confidence measure and incorporate only predictions above a certain threshold into the training process. In doing so, they assume that a higher prediction probability is associated with higher accuracy. In order to make self-training less sensitive to incorrect pseudo-labels,

[49] uses soft pseudo-labels and smoothed network predictions. Unlike the previously mentioned approaches, [40] and [3] do not explicitly create pseudo-labels but exploit entropy minimization and a maximum squares loss to conduct self-training. Again, both methods can further improve performance when only confident samples are considered.

Self-Ensembling: An ensemble considers the outputs of multiple independent models for the same input sample to arrive at a more reliable prediction. The basic idea is that different models make different errors, which can be compensated for in the majority. Ensembling can be employed in ST to create better pseudo-labels for the next training stages [44]

. In semi-supervised learning, a special variant called self-ensembling has shown remarkable results

[21, 35]. In this case, there is usually only one trainable model that minimizes an additional consistency loss between two different predictions of the same sample. While one prediction remains the output of the trainable network, current methods differ mainly in the creation of the second prediction. [21] uses an exponential moving average (EMA) to combine predictions generated at different times during training. This is also known as temporal ensemble. [21] also proposes to generate the second prediction with the same model using a different dropout mask and augmentations. [35] extends this idea and proposes a Mean Teacher (MT) framework where there is a second (non-trainable) model whose weights are updated with an EMA over the actual trainable weights. While [9] extends the former idea to UDA for classification, [7, 42, 37] apply the MT framework to UDA for semantic segmentation.

Contrastive Learning: CL [12] is a framework that learns representations by contrasting positive pairs against negative pairs. It has recently dominated the field of self-supervised representation learning [14, 13, 4, 5]. Contrastive learning encourages representations of positive pairs (typically different augmentations from an image, preserving semantic information) to be similar and negative pairs (different image instances) to be apart. In the area of domain adaptation (DA) for classification [19, 27], CL has been applied to align the feature space of different domains. Positive pairs are generated by using samples of the same class, but different domains. Negative pairs are chosen so that they belong to different classes and potentially different domains. In [27], the contrastive loss is realized through minimizing a Frobenius norm. [19] modifies the maximum mean discrepancy [31] to correspond to a contrastive loss. Whereas existing ideas in DA compute the contrastive loss in the feature space, recent work [4, 11] shows that using a separate projection space for the contrastive loss is very beneficial in the setting of self-supervised representation learning.

3 Method

Definitions: In UDA, we have a set of source images with corresponding segmenation maps , as well as unlabeled target images . The indices and denote the source and target domain. For simplicity, the image dimensions of both domains are described by . Furthermore, we explicitly divide the network into a feature extractor with parameters and a segmentation head with parameters . The latter outputs a softmax probability map , where is the total number of classes. The corresponding hard prediction and the source segmentation maps have dimensions

and are thus one-hot encoded. As shown in Fig.

2 a), we follow recent findings in self-supervised representation learning and integrate a projector with parameters into our network [4, 11]. Its task is to project the extracted feature maps into a projection space , where is usually smaller than .

3.1 Self-Training

To train the network in a supervised manner, the segmentation maps of the source domain are used to compute a weighted pixel-wise cross-entropy (CE) loss

(1)

where is a class balancing term that will be explained later. Although there is no label information available in the target domain, it is possible to minimize a second CE loss by converting the network predictions into pseudo-labels . The cross-entropy loss is then given by

(2)

where is again a weighting term. Minimizing both losses jointly will close the domain gap, if reliable peudo-labels can be provided. Although there may be many noisy predictions, especially in the early stages of the training, it is likely that some pixels will be correctly predicted by the model with some confidence. This is because not all pixels have the same transfer difficulty and some may be easier to transfer than others [50, 20, 22, 26]. To find these pixels during training, we use the entropy of the softmax values

(3)

Then, only the predictions that are among the most certain in terms of their entropy are converted into pseudo-labels, while all other predictions are ignored. This approach has already proven successful for MinEnt [40].

Although using the entropy as a guidance for selecting more reliable pseudo-labels works, their quality may remain moderate because there can be false predictions with high confidence. Therefore, we switch to the following strategy after a few epochs.

Temporal Ensemble: As mentioned earlier, most of the current self-training approaches try to find the correct target predictions by using confidence measures like the entropy or softmax probability. While this strategy can help to avoid including too many false predictions in the training, the information of uncertain but mostly correct predictions is also neglected. Since it is the uncertain forecasts that contain a lot of information, we try to generate reliable pseudo-labels for them as well. This is achieved by considering the predictive tendency of the model for a given sample over time. More precisely, this tendency is extracted by using a temporal ensemble. It considers the numerous target predictions made by the model during training for a specific image. This leads to a smoothing of the noisy predictions. Note that due to the high uncertainty contained in target predictions [40], they may even vary between a few classes, especially at the beginning of the training.

For a given image

, the temporal ensemble is realized through a tensor

that additively collects the target predictions at time . (Eq. 4). As the quality of the predictions improves during training, is multiplied by a stepwise increasing integer

(4)

This ensures that more recent predictions have a larger impact on the final decision. To generate pseudo-labels from the tensor , majority-voting is used. It selects the class with the most votes as hard prediction . Although it would be possible to exclude pseudo-labels that have only a slight majority in the temporal ensemble, we do not apply any thresholds here and leave this open to further research.

The advantage of this realization is that the ensemble can be implemented in a memory-efficient way. First, depending on the number of epochs and , it allows to use the uint8 format for . (In our experiments, did not exceed 4). Second, due to the one-hot encoding of , the ensemble can be realized as a sparse tensor. This drastically reduces memory requirements when the number of classes is large. For example, in one of our experiments with , only of the elements were non-zero.

Class Balancing: The motivation for the class balancing terms and in Eq. 1 and Eq. 2 arises from the possibility that there may be a discrepancy between the source and target class prior probabilities [43]. Such a difference can bias the final segmentation layer towards the source domain and thus negatively affect the segmentation of target samples [45]. To circumvent this problem, we use the source labels and the target pseudo-labels to compute and . Since pseudo-labels can better capture the presence of a class in an image than its exact number of pixels, the following strategy is used for both source and target domain. Instead of computing pixel-wise class prior probabilities, only the occurence probability of each class in the dataset is considered. For the source domain, it is given by , where is the number of images containing class , while is the total number of images in the dataset [43]. To save computational resources, and are approximated online during training starting from probability one for each class. The weighting terms for the source and target domain are then defined by

(5)

where is a hyperparamter to avoid a too large weighting of very rare classes. The class balancing terms have the following effects. First, by using , the model now focuses less on classes that occur frequently in the target dataset. This allows for better knowledge transfer for classes that are rare and perhaps more difficult to transfer [50]. Second, by additionally using , the class prior probabilities of both domains are approximately aligned [43].

SourceTargetClass 1Class 2CentroidsBefore AdaptationContrastive Adaptation
Figure 1: Left: before adaptation, right: after aligning category-wise centroids across domains using a contrastive loss.
SourceTargetprojection b)a)class-wiseaverage
Figure 2: a) Network architecture of our proposed approach, consisting of a feature extractor , a segmentation head as well as a projector . b) Procedure for calculating category-wise mean projections using a resized segmentation map.

3.2 Contrastive learning

Similar to [45], our method is based on the observation that pixels of the same class cluster in the feature space. However, due to the discrepancy between the feature distributions, this observation only applies to features of the source domain and not across domains. This circumstance is also reflected on the left side of Figure 1. To cluster features of the same class across domains, we use a contrastive loss (right side of Fig. 1). Following [4, 11] and in contrast to previous CL approaches proposed for UDA for classification [19, 27], the contrastive loss is computed in a projection space , see Fig. 2 a).

First, class-wise mean projections are calculated using all source images in the current batch. For this, it is assumed that the source segmentation masks can be used to assign a class to each projection. Since projections of misclassified features can have an unfavorable effect on the corresponding class means, only the correctly segmented ones are considered. They can be extracted by comparing each prediction with its corresponding ground truth, as in [41];

(6)

Note that the predictions and labels from Eq. 6 can have a different hight and width than the source projections . In this case, and

are first resized using nearest neighbor interpolation. Then for each class contained in the current batch, class-wise mean projections are calculated. This is accomplished by averaging over the height

and width of all source projection maps in the current batch

(7)

This approach is also illustrated in Fig. 2 b) for a batch consisting of only one source image. While Eq. 7 could be calculated using the entire source dataset rather than samples, this would require propagating all source images through the network once. Furthermore, this procedure would have to be repeated several times, since the projections and thus their mean values will most likely change during training. To still extract a global source centroid for each class without much computational effort, an exponential moving average is used

(8)

where is a momentum term. These global source centroids are also shown in yellow in Fig. 1. For the target domain, Eq. 7 is slightly modified. First, is replaced by (resized) pseudo-labels . Second, a separate class-wise mean projection is computed for each target image in the current batch. The reason is that each should roughly cluster around its respective source centre . Finally, the contrastive loss is given by

(9)

where is the indicator function evaluating to 1 iff and

denotes a cosine similarity, which is the dot product between two

-normalized vectors

and . Note that a similar loss was already used in [14, 4, 33]. It can be shown that

becomes minimal if the cosine similarity becomes maximal for projections of the same class (positive pairs) and minimal for different classes (negative pairs). Thus, the intra-class variance across domains is minimized while the inter-class variance is maximized. This is also denoted by the arrows in Fig.

1. However, the maximization of the inter-class variance is only implicit because it is the target mean representations which force the global source centroids to be apart. Since some classes may not always be present in a batch of samples, the contrastive loss is weighted by (Eq. 5). In this case, it encourages the model to focus more on aligning and separating rare classes when they appear in the batch. Finally, note that although a contrastive loss is used, only at most distances need to be calculated for each target image. This makes the approach suitable for models with large feature maps in terms of height and width.

By combining self-training through temporal ensembling and contrastive learning, we create a symbiotic framework in which both methods can directly benefit each other. ST mitigates the bias of the last segmentation layer towards the source domain improving the target predictions. Needless to say, both ST and CL profit from better pseudo-labels. Additionally, for CL, better pseudo-labels improve the quality of the target class means . This allows the semantic categories to be clustered more accurately across domains, resulting again in a better target prediction .

To sum up, the following loss function is minimized

(10)

where and

are two hyperparameters balancing the influence of self-training and contrastive learning. Our overall training procedure is shown in Algorithm

1. The derivatives of the losses are computed as and .

4 Experiments

4.1 Experimental Settings

Network Architecture: Similar to [40, 38, 46, 41, 23, 20, 22, 3] we deploy the DeepLab-V2 [1] framework with a ResNet-101 as feature extractor . As is common practice, we initialize

with weights pre-trained on ImageNet and freeze all batch normalization layers. The final segmentation head

consists of an Atrous Spatial Pyramid Pooling (ASPP) head. We fix the dilation rates of the ASPP head to {6, 12, 18, 24} as it was done in previous work. To create a symmetrical network architecture, we use a similar ASPP head for the projector as described before. The only difference is that the projector outputs a tensor with channels and does not use any non-linearity.

1:Iterations , Iterations for warm-up ,
Use ensemble pseudo-labels after iterations
2:Initialize and randomly
3:Initialize

with ImageNet pre-trained weights

4:Initialize ensemble tensor with zeros
5:for  do
6:     Sample minibatch , from and
7:     Update source weighting using Eq. 5
8:     Calculate using Eq. 1
9:     Calculate using Eq. 7
10:     Update using a cosine decay
11:     Update global source centers using Eq. 8
12:     Get target softmax probability maps
13:     if  then
14:         Update ensemble using Eq. 4
15:     end if
16:     if  then
17:         Create pseudo-labels from ensemble
18:     else
19:         Generate pseudo-labels for most certain
20:     end if
21:     Update target weighting using Eq. 5
22:     if  then
23:         Calculate and using Eq. 2 and 9
24:     end if
25:     Update parameters
26:end for
Algorithm 1 CLST Training Procedure

Implementation Details:

Our implementation adopts the PyTorch deep learning framework

[28]. Most of the hyperparameters are taken directly from the base architecture from [38]. We train the network using SGD with Nesterov acceleration to speed up convergence. We deploy a polynomial learning rate decay with an initial learning rate set to and an exponent of . The momentum of the optimizer is set to and we use a weight decay of . Furthermore, we use for both and . For the momentum term in Eq. 8 we use a cosine decay starting from . For training and testing, we rescale all source images to size and all target images to [38]. In addition, we also investigate the impact of color jittering and Gaussian blur, which was recently used in [37]. We train our model using batches with 2 source images and 2 target images. The model is pre-trained for k iterations on the source domain before we switch to the loss function from Eq. 10. The pseudo-labels from the temporal ensemble are used after k iterations. is initially set to 1 and incremented by 1 every k iterations. All iteration-related parameters were chosen to be approximately multiples of the number of target images and received no tuning. We set to , which limits the weighting of classes that occur in less than of the images in the dataset. This was set by inspecting the class occurence probabilities in the GTA5 dataset and was also not tuned.

Datasets and Metric: We evaluate our approach in the challenging synthetic-to-real scenario, where Cityscapes [8] is used as the real-world target domain. Cityscapes contains 2975 training and 500 validation images with a resolution of . One of the synthetic source datasets is GTA5 [29], which contains 24966 synthesized frames of size of the well known Grand Theft Auto V video game. We evaluate GTA5 Cityscapes using the common 19 classes. The second synthetic source dataset is SYNTHIA [30], which has 9400 images in total and only shares 16 classes with Cityscapes. Following [40, 50, 20, 37, 26], we evaluate this transfer with respect to all 16 classes and for a subset consisting of 13 classes. We train our model using all source images and evaluate it on the Cityscapes validation set. As a metric, we use the widely adopted mean-intersection-over-union (mIoU).

a) b)c)
Figure 3: UMAP feature space visualization of class-wise mean representations for Cityscapes validation set. a) After training on source data only. b) Using our contrastive and self-training approach (CLST). c) CLST with global source centers shown in black.

4.2 Ablation Studies

We begin by examining the impact of contrastive learning and self-training on the final results by setting either or to zero. Table 1 a) shows the best mIoU for GTA5 Cityscapes. In addition, it includes the results of the adversarial approach AdaptSegNet [38] that uses two discriminators to align the output distributions of the segmentation model at different depths. As can be seen, both CL and ST outperform the adversarial baseline and enable knowledge transfer between domains. When CL is combined with ST (CLST), the result improves by mIoU compared to plain ST and we reach mIoU. This illustrates that CL and ST can indeed benefit from each other.

a) GTA5 Cityscapes
Method mIoU
AdaptSegNet [38] 42.4
CL 43.7
ST 47.4
CLST 49.1
CLST + Aug
b) SYNTHIA Cityscapes
Method mIoU mIoU*
AdaptSegNet [38] 46.7
CL 40.6 46.3
ST 44.9 51.7
CLST 46.2 53.5
CLST + Aug
Table 1: Component analysis for a) GTA5 Cityscapes and b) SYNTHIA Cityscapes. In the latter case, the results are shown with respect to all 16- and only 13-classes (mIoU*).

To test, why the contribution of contrastive learning is much lower than that of self-training, we examined the influence of the two class balancing terms and . We observed a drop in performance of mIoU when only was used and more than mIoU when neither nor were weighted. Since contrastive learning cannot mitigate harmful biases in the final segmentation layer towards the source domain, or easy-to-transfer classes, the quality of the pseudo-labels remains moderate. As a result, the target means include more false predictions, causing a worse alignment. Finally, we added color jittering and Gaussian blur (), which increased the performance further.

Similar trends can be observed for SYNTHIA Cityscapes, where the results with respect to all 16- and only 13-classes (mIoU*) are presented in Table 1 b). Again, the contribution of is crucial, but the results can be improved by mIoU by adding the contrastive component . In this case, the effect of the augmentations is even larger and an increse of mIoU can be observed.

To investigate the influence of our ASPP projector, we calculated the contrastive loss directly on the feature space . As it turned out, the results without projector were around mIoU worse for GTA5 Cityscapes. We also experimented with different non-linear projectors having different number of layers and kernels. We found that none of them performed better than our linear ASPP projector, which is symmetrical to our segmentation head and therefore may calculate similar local and global features.

4.3 Feature Space Visualization

To visualize the quality of our learned representations, we compute for each target image in the Cityscapes validation set its own category-wise feature mean and visualize all of them using UMAP [25]. This allows us to make a direct comparison with a model trained on source data only. The results are illustrated in Fig. 3. If a model is trained on source data only a), the clusters are close to each other and are overlapping. As can be seen in b), our CLST algorithm forms much more separable clusters. In c), the global source centroids are now visible in black. Both source and target domain are in similar regions of the feature space, suggesting successful transfer. Note that these experiments were conducted without any data augmentation.

4.4 Comparisons with state-of-the-art methods

In Table 2, we compare our method with the current state-of-the-arts. We mainly show results that also use the DeepLab-V2 with a ResNet-101 as backbone. The only exception to this is CAG [45], which uses the more powerful DeepLabv3+ [2]. However, since this method minimizes an -norm between source and target features and also uses self-training, it was included in the comparison. Furthermore, we cite the results of other approaches directly from the corresponding papers. It is worth mentioning that the cited results may include several training stages [45, 41, 46, 23, 22, 26] or transfered source images [41, 23].

GTA5 Cityscapes: As shown in Table 2 a), by using similar augmentations like DACS [37], our method (CLST+Aug) beats most of the state-of-the-arts without exiting the training loop to recompute pseudo-labels. While this makes our method significantly less computationally expensive, it comes with slightly worse performance. This is due to the batch normalization layers, which in our experiments give better results in validation mode than in the training mode that is used for the temporal ensemble. If performance is most important, the results can be further improved by creating new pseudo-labels and then fine-tune (FT) the model on the target domain, minimizing only from Eq. 2. This approach was also used by IAST and improves our results by 1 mIoU for a model that was trained with augmentation (CLST + Aug + FT) and without augmentation (CLST + FT) in the previous stage. Note that the pseudo-labels were created without applying a certainty based threshold. Although other self-training based methods like CAG or IAST may yield competitive results, they rely on an adversarial warm-up. For example, CAG reports a drop of 6.3 mIoU when the model was not pre-trained with AT. Our approach, on the other hand, does not require such a warm-up, but may also benefit from it.

a) GTA5 Cityscapes

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

motor

bike

mIoU
AdaptSeg [38] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
ADVENT [40] 89.4 33.1 81.0 26.6 26.8 27.2 33.5 24.7 83.9 36.7 78.8 58.7 30.5 84.8 38.5 44.5 1.7 31.6 32.4 45.5
CBST [50] 91.8 53.5 80.5 32.7 21.0 34.0 28.9 20.4 83.9 34.2 80.9 53.1 24.0 82.7 30.3 35.9 16.0 25.9 42.8 45.9
MaxSquare [3] 89.4 43.0 82.1 30.5 21.3 30.3 34.7 24.0 85.3 39.4 78.2 63.0 22.9 84.6 36.4 43.0 5.5 34.7 33.5 46.4
PatchAlign [39] 92.3 51.9 82.1 29.2 25.1 24.5 33.8 33.0 82.4 32.8 82.2 58.6 27.2 84.3 33.4 46.3 2.2 29.5 32.3 46.5
PLCA [20] 84.0 30.4 82.4 35.3 24.8 32.2 36.8 24.5 85.5 37.2 78.6 66.9 32.8 85.5 40.4 48.0 8.8 29.8 41.8 47.7
MRNet [46] 90.5 35.0 84.6 34.3 24.0 36.8 44.1 42.7 84.5 33.6 82.5 63.1 34.4 85.8 32.9 38.2 2.0 27.1 41.8 48.3
BDL [23] 91.0 44.7 84.2 34.6 27.6 30.2 36.0 36.0 85.0 43.6 83.0 58.6 31.6 83.3 35.3 49.7 3.3 28.8 35.6 48.5
SIM [41] 90.1 44.7 84.8 34.3 28.7 31.6 35.0 37.6 84.7 43.3 85.3 57.0 31.5 83.8 42.6 48.5 1.9 30.4 39.0 49.2
CCM [22] 93.5 57.6 84.6 39.3 24.1 25.2 35.0 17.3 85.0 40.6 86.5 58.7 28.7 85.8 49.0 56.4 5.4 31.9 43.2 49.9
CAG [45] 90.4 51.6 83.8 34.2 27.8 38.4 25.3 48.4 85.4 38.2 78.1 58.6 34.6 84.7 21.9 42.7 41.1 29.3 37.2 50.2
IAST [26] 93.8 57.8 85.1 39.5 26.7 26.2 43.1 34.7 84.9 32.9 88.0 62.6 29.0 87.3 39.2 49.6 23.2 34.7 39.6 51.5
DACS [37] 89.9 39.6 87.8 30.7 39.5 38.5 46.4 52.7 87.9 43.9 88.7 67.2 35.7 84.4 45.7 50.1 0.0 27.2 33.9 52.1
CLST 90.5 42.6 83.8 35.0 26.5 24.5 40.8 35.5 84.7 37.2 81.8 63.2 36.4 85.4 41.1 51.7 0.1 25.2 47.4 49.1
CLST + FT 91.2 44.5 84.4 35.9 27.4 24.0 41.2 37.3 85.3 39.7 83.1 63.7 37.6 85.9 43.2 50.8 0.1 27.7 49.7 50.1
CLST + Aug 92.6 52.8 85.6 35.3 27.4 28.4 42.5 37.0 84.4 36.2 87.8 63.2 35.9 86.2 43.6 49.7 0.4 25.1 48.5 50.6
CLST + Aug + FT 92.8 53.5 86.1 39.1 28.1 28.9 43.6 39.4 84.6 35.7 88.1 63.9 38.3 86.0 41.6 50.6 0.1 30.4 51.7 51.6
b) SYNTHIA Cityscapes

road

sidewalk

building

wall*

fence*

pole*

light

sign

veg

sky

person

rider

car

bus

motor

bike

mIoU mIoU*
AdaptSeg [38] 84.3 42.7 77.5 - - - 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 - 46.7
ADVENT [40] 85.6 42.2 79.7 - - - 5.4 8.1 80.4 84.1 57.9 23.8 73.3 36.4 14.2 33.0 - 48.0
CBST [50] 68.0 29.9 76.3 10.8 1.4 33.9 22.8 29.5 77.6 78.3 60.6 28.3 81.6 23.5 18.8 39.8 42.6 48.9
MaxSquare [3] 82.9 40.7 80.3 10.2 0.8 25.8 12.8 18.2 82.5 82.2 53.1 18.0 79.0 31.4 10.4 35.6 41.4 48.2
PatchAlign [39] 82.4 38.0 78.6 8.7 0.6 26.0 3.9 11.1 75.5 84.6 53.5 21.6 71.4 32.6 19.3 31.7 40.0 46.5
PLCA [20] 82.6 29.0 81.0 11.2 0.2 33.6 24.9 18.3 82.8 82.3 62.1 26.5 85.6 48.9 26.8 52.2 46.8 54.0
MRNet [46] 83.1 38.2 81.7 9.3 1.0 35.1 30.3 19.9 82.0 80.1 62.8 21.1 84.4 37.8 24.5 53.3 46.5 53.8
BDL [23] 86.0 46.7 80.3 - - - 14.1 11.6 79.2 81.3 54.1 27.9 73.7 42.2 25.7 45.3 - 51.4
SIM [41] 83.0 44.0 80.3 - - - 17.1 15.8 80.5 81.8 59.9 33.1 70.2 37.3 28.5 45.8 - 52.1
CCM [22] 79.6 36.4 80.6 13.3 0.3 25.5 22.4 14.9 81.8 77.4 56.8 25.9 80.7 45.3 29.9 52.0 45.2 52.9
CAG [45] 84.7 40.8 81.7 7.8 0.0 35.1 13.3 22.7 84.5 77.6 64.2 27.8 80.9 19.7 22.7 48.3 44.5 -
IAST [26] 81.9 41.5 83.3 17.7 4.6 32.3 30.9 28.8 83.4 85.0 65.5 30.8 86.5 38.2 33.1 52.7 49.8 57.0
DACS [37] 80.5 25.1 81.9 21.4 2.8 37.2 22.6 23.9 83.6 90.7 67.6 38.3 82.9 38.9 28.4 47.5 48.3 54.8
CLST 79.0 36.7 81.3 12.7 0.3 30.4 26.3 21.5 83.7 86.2 56.4 21.1 84.6 44.8 20.7 53.4 46.2 53.5
CLST + FT 81.1 39.1 81.9 15.1 0.5 30.6 27.5 23.3 84.5 86.3 58.0 23.6 85.7 48.1 25.1 55.5 47.8 55.3
CLST + Aug 85.9 45.7 81.9 13.8 0.3 29.6 30.5 24.2 83.6 87.9 57.7 25.3 85.2 46.9 22.3 54.8 48.5 56.3
CLST + Aug + FT 88.0 49.2 82.2 16.3 0.4 29.2 31.8 23.9 84.1 88.0 59.1 27.2 85.5 46.6 28.9 56.5 49.8 57.8
Table 2: Comparison to state-of-the-art results for Cityscapes validation set and task a) GTA5 Cityscapes and b) SYNTHIA Cityscapes. In the latter case, we report the mIoU with respect to all 16- and only 13-classes, excluding all classes marked with ”*”.

SYNTHIA Cityscapes: Similar results can be observed in Table 2 b), where we show the mIoU with respect to all 16- and only 13-classes, excluding the classes marked with ”*”. Again, CLST + Aug outperforms most of the other methods, including DACS. If we apply the same strategy as explained before and fine-tune our model on the target domain, we observe an increase of mIoU for a model that was trained with augmentation in the first stage. When trained without augmentation, the increase was mIoU. Overall, we achieve equivalent results to IAST for 16 classes and better results for 13 classes.

5 Conclusion

In this work, we proposed a symbiotic setup for UDA for semantic segmentation. It combines recent ideas in contrastive learning, using a separate projection space, with self-training. For reliable and consistent pseudo-labels over time a memory-efficient temporal ensemble is used. Each individual component already contributes to a knowledge transfer between domains. It is their combination that yields better or equivalent results than the state-of-the-art on two common synthetic-to-real benchmarks: GTA5 Cityscapes and SYNTHIA Cityscapes.

Acknowledgments

This publication was created as part of the research project ”KI Delta Learning” (project number: 19A19013R) funded by the Federal Ministry for Economic Affairs and Energy (BMWi) on the basis of a decision by the German Bundestag.

References

  • [1] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1, §4.1.
  • [2] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In

    Proceedings of the European conference on computer vision (ECCV)

    ,
    pp. 801–818. Cited by: §1, §4.4.
  • [3] M. Chen, H. Xue, and D. Cai (2019) Domain adaptation for semantic segmentation with maximum squares loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2090–2099. Cited by: §2, §4.1, §4.4, Table 2.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2, §3.2, §3.2, §3.
  • [5] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §2.
  • [6] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. Frank Wang, and M. Sun (2017) No more discrimination: cross city adaptation of road scene segmenters. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1992–2001. Cited by: §2.
  • [7] J. Choi, T. Kim, and C. Kim (2019) Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 6830–6840. Cited by: §1, §2, §2.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3213–3223. Cited by: §1, §4.1.
  • [9] G. French, M. Mackiewicz, and M. Fisher (2017) Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §2.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. 2672–2680. Cited by: §1, §2.
  • [11] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020)

    Bootstrap your own latent: a new approach to self-supervised learning

    .
    arXiv preprint arXiv:2006.07733. Cited by: §2, §3.2, §3.
  • [12] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1, §2.
  • [13] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §2.
  • [14] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In

    International Conference on Machine Learning

    ,
    pp. 4182–4192. Cited by: §2, §3.2.
  • [15] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: §1, §2.
  • [16] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §1, §2.
  • [17] H. Huang, Q. Huang, and P. Krahenbuhl (2018) Domain transfer through deep activation matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 590–605. Cited by: §1.
  • [18] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2.
  • [19] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann (2019) Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4893–4902. Cited by: §1, §2, §3.2.
  • [20] G. Kang, Y. Wei, Y. Yang, Y. Zhuang, and A. G. Hauptmann (2020) Pixel-level cycle association: a new perspective for domain adaptive semantic segmentation. arXiv preprint arXiv:2011.00147. Cited by: §3.1, §4.1, §4.1, §4.4, Table 2.
  • [21] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1, §2.
  • [22] G. Li, G. Kang, W. Liu, Y. Wei, and Y. Yang (2020) Content-consistent matching for domain adaptive semantic segmentation. In European Conference on Computer Vision, pp. 440–456. Cited by: §1, §2, §3.1, §4.1, §4.4, §4.4, Table 2.
  • [23] Y. Li, L. Yuan, and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6936–6945. Cited by: §1, §2, §4.1, §4.4, §4.4, Table 2.
  • [24] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang (2019) Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2507–2516. Cited by: §1.
  • [25] L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §4.3.
  • [26] K. Mei, C. Zhu, J. Zou, and S. Zhang (2020) Instance adaptive self-training for unsupervised domain adaptation. arXiv preprint arXiv:2008.12197. Cited by: §1, §2, §3.1, §4.1, §4.4, §4.4, Table 2.
  • [27] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5715–5725. Cited by: §1, §2, §3.2.
  • [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • [29] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: §1, §4.1.
  • [30] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3234–3243. Cited by: §4.1.
  • [31] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013) Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pp. 2263–2291. Cited by: §1, §2.
  • [32] T. Shen, D. Gong, W. Zhang, C. Shen, and T. Mei (2019) Regularizing proxies with multi-adversarial training for unsupervised domain-adaptive semantic segmentation. arXiv preprint arXiv:1907.12282. Cited by: §2, §2.
  • [33] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 1857–1865. Cited by: §3.2.
  • [34] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: §1.
  • [35] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §2.
  • [36] M. Toldo, U. Michieli, G. Agresti, and P. Zanuttigh (2020) Unsupervised domain adaptation for mobile semantic segmentation based on cycle consistency and feature alignment. Image and Vision Computing 95, pp. 103889. Cited by: §1, §2.
  • [37] W. Tranheden, V. Olsson, J. Pinto, and L. Svensson (2021) DACS: domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1379–1389. Cited by: §2, §4.1, §4.1, §4.4, §4.4, Table 2.
  • [38] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §1, §2, §4.1, §4.1, §4.2, §4.2, §4.4, Table 1, Table 2.
  • [39] Y. Tsai, K. Sohn, S. Schulter, and M. Chandraker (2019) Domain adaptation for structured output via discriminative patch representations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1456–1465. Cited by: §1, §2, §4.4, Table 2.
  • [40] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2517–2526. Cited by: §1, §1, §2, §2, §3.1, §3.1, §4.1, §4.1, §4.4, Table 2.
  • [41] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020) Differential treatment for stuff and things: a simple unsupervised domain adaptation method for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12635–12644. Cited by: §1, §2, §2, §3.2, §4.1, §4.4, §4.4, Table 2.
  • [42] Y. Xu, B. Du, L. Zhang, Q. Zhang, G. Wang, and L. Zhang (2019) Self-ensembling attention networks: addressing domain shift for semantic segmentation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 5581–5588. Cited by: §2.
  • [43] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo (2017) Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2272–2281. Cited by: §3.1.
  • [44] Y. Yang and S. Soatto (2020) Fda: fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4085–4095. Cited by: §1, §1, §2, §2.
  • [45] Q. Zhang, J. Zhang, W. Liu, and D. Tao (2019) Category anchor-guided unsupervised domain adaptation for semantic segmentation. In Advances in Neural Information Processing Systems, pp. 435–445. Cited by: §1, §1, §2, §2, §3.1, §3.2, §4.4, §4.4, Table 2.
  • [46] Z. Zheng and Y. Yang (2019) Unsupervised scene adaptation with memory regularization in vivo. arXiv preprint arXiv:1912.11164. Cited by: §2, §2, §4.1, §4.4, §4.4, Table 2.
  • [47] Z. Zheng and Y. Yang (2021) Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision, pp. 1–15. Cited by: §2.
  • [48] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.
  • [49] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang (2019) Confidence regularized self-training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5982–5991. Cited by: §1, §2.
  • [50] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pp. 289–305. Cited by: §2, §3.1, §3.1, §4.1, §4.4, Table 2.