DeepAI
Log In Sign Up

ML-BPM: Multi-teacher Learning with Bidirectional Photometric Mixing for Open Compound Domain Adaptation in Semantic Segmentation

Open compound domain adaptation (OCDA) considers the target domain as the compound of multiple unknown homogeneous subdomains. The goal of OCDA is to minimize the domain gap between the labeled source domain and the unlabeled compound target domain, which benefits the model generalization to the unseen domains. Current OCDA for semantic segmentation methods adopt manual domain separation and employ a single model to simultaneously adapt to all the target subdomains. However, adapting to a target subdomain might hinder the model from adapting to other dissimilar target subdomains, which leads to limited performance. In this work, we introduce a multi-teacher framework with bidirectional photometric mixing to separately adapt to every target subdomain. First, we present an automatic domain separation to find the optimal number of subdomains. On this basis, we propose a multi-teacher framework in which each teacher model uses bidirectional photometric mixing to adapt to one target subdomain. Furthermore, we conduct an adaptive distillation to learn a student model and apply consistency regularization to improve the student generalization. Experimental results on benchmark datasets show the efficacy of the proposed approach for both the compound domain and the open domains against existing state-of-the-art approaches.

READ FULL TEXT VIEW PDF

page 13

page 14

page 17

04/19/2020

Uncertainty-Aware Consistency Regularization for Cross-Domain Semantic Segmentation

Unsupervised domain adaptation (UDA) aims to adapt existing models of th...
12/15/2020

Cluster, Split, Fuse, and Update: Meta-Learning for Open Compound Domain Adaptive Semantic Segmentation

Open compound domain adaptation (OCDA) is a domain adaptation setting, w...
08/16/2021

Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation

In this work, we address the task of unsupervised domain adaptation (UDA...
10/20/2020

Teacher-Student Consistency For Multi-Source Domain Adaptation

In Multi-Source Domain Adaptation (MSDA), models are trained on samples ...
07/04/2022

Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Batch normalization (BN) is a ubiquitous technique for training deep neu...
11/19/2021

Bi-Mix: Bidirectional Mixing for Domain Adaptive Nighttime Semantic Segmentation

In autonomous driving, learning a segmentation model that can adapt to v...
03/02/2021

Simulation-to-Real domain adaptation with teacher-student learning for endoscopic instrument segmentation

Purpose: Segmentation of surgical instruments in endoscopic videos is es...

1 Introduction

Semantic segmentation is a fundamental task in finding applications to many problems, including robotics [34], autonomous driving [35], and medical diagnosis [16]

. Recently, deep learning-based semantic segmentation approaches 

[12, 36, 35] have achieved remarkable progress. However, their effectiveness and generalization ability require a large amount of pixel-wised annotated data which are expensive to collect. To reduce the cost of data collection and annotation, numerous synthetic datasets have been proposed [21, 22]. However, the models trained on synthetic data tend to poorly generalize to real images. To cope with this issue, unsupervised domain adaptation (UDA) methods [26, 28, 37, 17, 25, 33] have proposed to align the domain gap between the source and the target domain. Despite the efficacy of UDA techniques, most of these works rely on the strong assumption that the target data is composed of a single homogeneous domain. This assumption is often violated in real-world scenarios. As an illustration in autonomous driving, the target data will likely be composed of various subdomains such as night, snow, rain, etc. Therefore, directly applying the current UDA approaches to these target data might deliver limited performance. This paper focuses on the challenging problem of open compound domain adaptation (OCDA) in semantic segmentation where the target domain is unlabeled and contains multiple homogeneous subdomains. The goal of OCDA is to adapt a model to a compound target domain and to further enhance the model generalization to the unseen domains.
To perform OCDA, Liu et al. [13] propose an easy-to-hard curriculum learning strategy, where samples closer to the source domain will be chosen first for adaptation. However, it does not fully take advantage of the subdomain boundaries information in the compound target domain. To explicitly consider this information, current OCDA works [8, 19] propose to separate the target compound domain into multiple subdomains based on image style information. Existing works use a manual domain separation method; they also employ a single model to simultaneously adapt to all the target subdomain. However, adapting to a target subdomain might hinder the model from adapting to other dissimilar target subdomains, which leads to limited performance. We propose a multi-teacher framework with bidirectional photometric mixing for open compound domain adaptation in semantic segmentation to tackle this issue. First, we propose automatic domain separation to find the optimal number of subdomains and split the target compound domain. Then, we present a multi-teacher framework in which each teacher model uses bidirectional photometric mixing to adapt to one target subdomain. On this basis, we conduct adaptive distillation to learn a student model and apply a fast and short online updating using consistency regularization to improve the student’s generalization to the open domains. We evaluate our approach on the benchmark datasets. The proposed approach outperforms all the existing state-of-the-art OCDA techniques and the latest UDA techniques for domain adaptation and domain generalization task.
The Contribution of This Work. (1) we propose automatic domain separation to find the optimal number of target subdomains; (2) we present a multi-teacher framework with bidirectional photometric mixing to reduce the domain gaps between the source domain and every target subdomain separately; (3) we further conduct an adaptive distillation to learn a student model and apply consistency regularization to improve the student generalization to the open domains.

2 Related Work

Unsupervised Domain Adaptation. Unsupervised domain adaptation (UDA) techniques are used to reduce the expensive cost of pixel-wise labeling tasks like semantic segmentation. In UDA, adversarial learning is used actively to align input-level style using image translation, feature distribution, or structured output [27, 10, 28, 17, 29]. Alternatively, self-training approaches [2, 33, 25, 37] have also recently demonstrated compelling performance in this context. While these works have shown significant improvement, adopting those works directly for practical usage shows limitations due to its restricted setting dealing with only single source and single target. Despite the improvement provided by UDA techniques, their applicability to real scenarios remains restricted by the implicit assumption that the target data contains images from a single distribution.

Domain Generalization. The purpose of domain generalization (DG) is to train a model – solely using source domain data – such that it can perform reliable predictions on unseen domain. While DG is an essential problem, a few works have attempted to address this problem in the task of semantic segmentation. DG for semantic segmentation shows two main streams: augmentation-based and network-based approaches. The augmentation-based approaches [30, 11] propose to significantly augment the training data via an additional style dataset to learn domain-invariant representation. The network-based approaches [18, 4] attempt to modify the structure of the network to minimize domain-specific information (such as colors or styles) such that the resulting model mainly focuses on the content-specific information. Even though DG for semantic segmentation has achieve obvious progress, their performance is inevitably lower than several UDA methods due to the absence of the target images, which is capable of providing abundant domain-specific information.

Open Compound Domain Adaptation. Liu et al. [13] firstly suggests Open Compound Domain Adaptation (OCDA) that handles unlabeled compound heterogeneous target domain and unseen open domain. While Liu et al. [13] propose a curriculum learning strategy, it fails to consider the specific information of each target subdomain. Current OCDA works [8, 19] propose to separate the compound target domain into multiple subdomains to handle the intra-domain gaps. Gong et al. [8]

adopt domain-specific batch normalization for adaptation. Park 

et al. [19] utilize GAN-based image translation and adversarial training to exploit domain invariant features from multiple subdomains.

3 Generating Optimal Subdomains

Figure 1: The part of generating optimal subdomains consists of automatic domain separation (ADS) and subdomain style purification (SSP). In ADS, we adopt Silhouette Coefficient [23] to find the optimal number of subdomains . In SSP, we calculate mean of histogram for the target subdomain according to Equation 4, and the purified subdomain is denoted as .

3.1 Automatic Domain Separation

Our work assumes that the domain-specific property of images comes from their styles. Existing works adopt a predefined parameter to decide the number of subdomains, which might lead to a nonoptimal domain adaptation performance; furthermore, they rely on a pre-trained CNN-based encoder to extract the style information for the subdomain discovery. However, we propose an automatic domain separation (ADS) to effectively separate the target domain using the distribution of pixel values of the target images. The proposed ADS is capable of predicting the optimal number of subdomains without relying on any predefined parameters and extracting the image style information without relying on any pre-trained CNN models. We denote the source domain as , and the unlabeled compound target domain as . We also assume compound target domain contains latent subdomains: , which lack of clear prior knowledge to distinguish themselves. The goal of ADS is to find the optimal number of subdomains and separate into several subdomains accordingly.

Current work [14] suggests a simple yet effective style translation method by matching the distribution of pixel values on LAB color space. Thus, we adopt LAB space into ADS to extract the style information of the target image. Given a target RGB image as input, we convert it into LAB color space . The three channels in LAB color space are represented as , and . Then, we compute the histograms of the pixel values for all three channels in LAB color space: , and . The histograms are concatenated and represented as the style information of . Let denote the concatenated histograms of , and we take

as input to ADS for domain separation. However, most existing clustering algorithms require a hyperparameter to determine the number of clusters. Directly applying a naive clustering might lead to a nonoptimal adaptation performance. Thus, we propose to find the optimal number

of the subdomains using Silhouette Coefficient (SC) [23]. Suppose the target domain is separated into subdomains, . For each target image , we denote as the average distance between and all other target images in the target subdomain to which belongs. Additionally, we use to represent the minimum average distance from to all other target subdomains to which does not belong. Let us assume belongs to the target subdomain , then and are written as

(1)

where represents the euclidean distance of and , and is the number of the target images in . The SC score for number of the target subdomains is given by

(2)

Hence, the goal of the proposed ADS is to find for

(3)

3.2 Subdomain Style Purification

With the help fo automatic domain separation, the number of abnormal samples with different styles is small inside each target subdomain. Though these abnormal samples might be useful for the model’s generalization, they could also lead to a negative transfer, which further hinders the model from learning domain invariant features in a specific subdomain. To cope with it, we propose to purify the style distribution of the target images inside each subdomain. We design a subdomain style purification (SSP) module to effectively make similar styles for the images within the same subdomain. Given the target subdomain , we adopt the histograms of LAB color space (mentioned in 3.1), and then we compute the mean of the histograms for all the three channels, represented by , and , and this process is achieved by

(4)

where represents the number of the target images in . We take as the standard style for . For each target RGB image , we change the style of to generate the RGB new image by the histogram matching [20] on , and on the LAB color space. The process of SSP is done for all the subdomains . We denote the purified subdomains after SSP as .

4 Multi-teacher Framework

Figure 2: (a) The architecture of the proposed bidirectional photometric mixing. (b) The diagram of the multi-teacher learning framework.

4.1 Bidirectional Photometric Mixing

Through automatic domain separation and subdomain style purification (mentioned in 3.1 and 3.2), the compound domain is automatically separated into multiple subdomains , where represents the optimal number of the subdomains. Our next plan is to minimize the domain gap between the source domain and each target subdomain. A recent UDA work DACS [25] presents a mixing-based UDA technique for semantic segmentation. Inspired by DACS, we propose bidirectional photometric mixing (BPM) to minimize the domain gap between the source domain and each target subdomain separately. Compared with DACS, the proposed BPM adopts a photometric transform to decrease the style inconsistency of the mixed images to reduce the pixel-level domain gap. On this basis, BPM applies a bidirectional mixing scheme to provide a more robust regularization for training. The architecture of BPM is shown in Figure 2(a). The proposed BPM contains a domain adaptive segmentation network and a momentum network that improves the stability of pseudo labels. Let denote the source RGB image and its pixel-wise annotation map, , . And represent a purified target RGB image from the purified subdomain , . Note that and represent the size of height and width. Our BPM applies the mixing in two directions: and .

On the direction of mixing from , we choose ClassMix [15] because the source image has the pixel-wise annotation map . We first randomly select some classes from . Then, we define as a binary mask in which when the pixel position of belongs to the selected classes, and otherwise. While ClassMix suggests directly copying the corresponding pixels of selected classes of onto , the mixed image generated by ClassMix contains inconsistent style distribution which might hinder the adaptation performance. To cope with the limitation, the proposed BPM applies photometric transform on the selected source pixels to the style of target image before directly copying them onto it. Let represent the selected source pixels by the mask , and is element-wise multiplication. We first calculate the histograms of selected source pixels in LAB color space, and match them with . The translated source pixels is represented as . Then, we copy the translated source pixels onto . We present some qualitative results in Figure 4. Note that no ground-truth annotation is available for . Thus, we send the purified target image to the momentum network to generate a stable prediction map as the pseudo label. The mixing process on the direction of by BPM is shown as

(5)

where is the generated mixed image, is the corresponding mixed pseudo label, and is the photometric transform of the source selected pixels by histogram matching on LAB color space.

On the direction of mixing from , however, it is impossible to choose ClassMix since no ground-truth annotation is available for . Inspired by CutMix [31], we generate another binary mask by sampling rectangular bounding box

according to the uniform distribution;

, where , are the height and width of the image. The binary mask is formed by filling with the pixel positions inside the bounding box, and filling with other positions. With the help of , we select the target pixels and transform them into the source style. The transformed target pixel is represented by . Then we paste them onto the source image . We present the mixing of at

(6)

where is the other generated mixed image, is the corresponding mixed pseudo label, and is the photometric transform of the target selected pixels by histogram matching on LAB color space.

we and to train the segmentation network and the momentum network . We first optimize the parameters of through

(7)

where represent the parameters of , is the cross-entropy loss for the predicted segmentation maps and the ground-truth or pseudo labels, and

are the hyper-parameters to control the effect of the mixing of both the directions for the loss function. To help the momentum network

provide stable pseudo labels, we update the parameters of , represented by , using an exponential moving average (EMA) with a momentum . After finishing the training iteration , is updated by

(8)

4.2 Multi-teacher Adaptive Knowledge Distillation

We propose a multi-teacher framework followed by an adaptive knowledge distillation to align the domain gaps between the source domain and all the target subdomains. Given a purified subdomain , we adopt a BPM as a specific teacher model to minimize the domain gap between and . And we train the proposed multi-teacher framework by minimizing the loss function on all the teacher models, i.e.,

(9)

where (defined in Equation 7) is the loss function of the segmentation network in the teacher model, and is the optimal number of the subdomains. Moreover, We learn a segmentation network as the student network via an adaptive knowledge distillation from all the teacher networks . Given a random target data from , we send to all the teachers model, and the student is to learn from a weighted average of the all teacher’s predictions , based on the teacher’s confidence score. We adopt the entropy of ’s prediction map as the confidence of the teacher model, where is the total number of classes we consider. Thus, the weight for the teacher and the average prediction are formulated as

(10)

On this basis, we optimize the student segmentation network with a distillation loss defined by

(11)

where is KL divergence loss function between the output of and . The goal of the multi-teacher adaptive knowledge distillation is to achieve the optimal parameters of the student segmentation network by

(12)

Online Updating with Consistency Regularization. To evaluate the generalization of our approach, we directly evaluate our student network on the open domains as shown in Table 2(a) and Table 2(b). Additionally, after finishing the compound domain adaptation training, we also provide a fast and short online updating for the student network using consistency regularization. This would further boost the generalization of the student network. Given an RGB image from an open domain, we first match the style of to other standard styles from the existing target subdomains. The standard styles are defined as the mean histograms (defined in 3.2). The newly transformed images are , where is generated by matching to the style of the subdomain . Thus, we conduct an online updating for the student network by

(13)

where is the mean absolute loss. After the online updating, we test the student network with newly learnt parameters again on the open domains.

5 Experiments

5.1 Experimental Setup

5.1.1 Dataset.

In this work, we adopt the synthetic datasets, including GTA5 [21] and SYNTHIA [22] as the source domains. GTA5 contains annotated images of resolution. SYNTHIA consists of images with resolution. Furthermore, we adopt C-Driving [13] as the compound target domains which contains real images of resolution collected from different weather conditions. Following the settings of previous works [13, 19, 8], we use the rainy, snowy, cloudy images as the compound target domain and adopt overcast images as the open domain. We also use ACDC [24]

as another compound target domain and the evaluation results are shown in supplementary material. We further adopt Cityscapes 

[5], KITTI [1], and WildDash [32] as the open domains to evaluate the generalization ability of the proposed approach.

5.1.2 Implementation Details.

We adopt DeepLab-V2 [3] with ResNet101 backbone [9]

pre-trained on ImageNet 

[6]. All the images from target domain are rescaled into and then randomly cropped into . The batch size is set up with and the total number of training iterations is

. We adopt stochastic gradient descent to optimize all the segmentation networks, with a weight decay of

and momentum of . The learning rate is set up with an initial value of and decreased by polynomial decay with an exponent of . The momentum network has the same network architecture as the segmentation network. Existing mixing techniques contain CutMix [31], CowMix [7] and ClassMix [15]. We adopt ClassMix on the mixing direction of the source domain to the target domain, and we apply CutMix on the mixing direction of the target domain to the source domain. Both and are set up with in the experiments. To increase the robustness of the segmentation model, we adopt data augmentations, including flipping, color jittering, and Gaussian blurring on the mixed images.

(a) GTA5C-Driving
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mIoU
Source - 73.4 12.5 62.8 6.0 15.8 19.4 10.9 21.1 54.6 13.9 76.7 34.5 12.4 68.1 31.0 12.8 0.0 10.1 1.9 28.3
CDAS [13] OCDA 79.1 9.4 67.2 12.3 15.0 20.1 14.8 23.8 65.0 22.9 82.6 40.4 7.2 73.0 27.1 18.3 0.0 16.1 1.5 31.4
CSFU [8] OCDA 80.1 12.2 70.8 9.4 24.5 22.8 19.1 30.3 68.5 28.9 82.7 47.0 16.4 79.9 36.6 18.8 0.0 13.5 1.4 34.9
SAC [2] UDA 81.5 23.8 72.0 10.3 27.8 23.0 18.2 34.1 70.3 27.9 87.8 45.0 16.9 77.6 38.5 19.8 0.0 14.0 2.7 36.4
DACS [25] UDA 81.9 24.0 72.2 11.9 28.6 24.2 18.3 35.4 71.8 28.0 87.7 44.9 15.6 78.4 39.1 24.9 0.1 6.9 1.9 36.6
DHA[19] OCDA 79.9 14.5 71.4 13.1 32.0 27.1 20.7 35.3 70.5 27.5 86.4 47.3 23.3 77.6 44.0 18.0 0.1 13.7 2.5 37.1
Ours OCDA 85.3 26.2 72.8 10.6 33.1 26.9 24.6 39.4 70.8 32.5 87.9 47.6 29.2 84.8 46.0 22.8 0.2 16.7 5.8 40.2
(b) SYNTHIAC-Driving
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

sky

person

rider

car

bus

mbike

bike

mIoU mIoU
Source - 33.9 11.9 42.5 1.5 0.0 14.7 0.0 1.3 56.8 76.5 13.3 7.4 57.8 12.5 2.1 1.6 20.9 28.1
CDAS [13] OCDA 54.5 13.0 53.9 0.8 0.0 18.2 13.0 13.2 60.0 78.9 17.6 3.1 64.2 12.2 2.1 1.5 25.3 34.0
CSFU [8] OCDA 69.6 12.2 50.9 1.3 0.0 16.7 12.1 13.6 56.2 75.8 20.0 4.8 68.2 14.1 0.9 1.2 26.1 34.8
SAC [2] UDA 69.8 13.4 56.2 1.7 0.0 20.0 9.6 13.7 52.5 78.1 29.1 15.5 68.9 10.9 3.2 1.2 27.7 36.3
DACS [25] UDA 62.1 15.2 48.8 0.3 0.0 19.7 10.3 9.6 57.8 84.4 35.2 18.9 67.8 16.0 2.2 1.7 28.1 36.5
DHA [19] OCDA 67.5 2.5 54.6 0.2 0.0 25.8 13.4 27.1 58.0 83.9 36.0 6.1 71.6 28.9 2.2 1.8 29.9 37.6
Ours OCDA 73.4 15.2 57.1 1.8 0.0 23.2 13.5 23.9 59.9 83.3 40.3 22.3 72.2 23.3 2.3 2.2 32.1 40.0
Table 1: The performance comparison of mean IoU on the compound domain. Our approach is compared with the state-of-the-art UDA and OCDA approaches on (a) GTA5C-Driving and (b) SYNTHIAC-Driving benchmark dataset with ResNet-101 as the backbone. Note that mIoU represents the mean IoU of 11 classes, excluding the class with .

5.2 Results

To demonstrate the efficacy of our approach, we conduct experiments on the benchmark datasets of GTA5C-Driving and SYNTHIAC-Driving. We first compare our approach with the existing state-of-the-art OCDA approaches: CDAS [13], DHA [19], and CSFU [8]. Furthermore, we compare the proposed approach with the current state-of-the-art UDA approaches SAC [2] and DACS [25].

5.2.1 Compound Domain Adaptation.

We first compare the performance of our approach with existing state-of-the-art OCDA and UDA approaches on GTA5 C-Driving, shown in Table 1a. All the results are generated on the validation set of C-Driving. Training only with the source data leads to of mean IoU over the 19 classes. As the first work in OCDA, CDAS achieves on the mean IoU of all the classes. CSFU generates of mean IoU, and DHA produces of mean IoU. This is because both CSFU and DHA adopt the subdomain separation step and GAN framework, and DHA uses a more effective multi-discriminator to minimize the domain gaps. In comparison, the latest UDA approaches DACS and SAC show and , outperforming both CDAS and CSFU. The reason behind is that both DACS and SAC adopt various self-supervision techniques to minimize the domain gaps, which proves to be more effective than GAN-based approaches. In comparison, the proposed approach demonstrates effectiveness on this benchmark dataset with of mean IoU over all classes.

We present experimental results on SYNTHIAC-Driving shown in Table 1b. We consider the 11 classes for final evaluation. The proposed method achieves of mean IoU over the 11 classes. For other OCDA approaches, DHA achieves , CSFU produces , and CDAS generates of mean IoU. Moreover, the UDA approaches DACS and SAC generate and of mean IoU. Our approach outperforms all the existing OCDA approaches and the latest UDA approaches.

GTA5
Method Type O C K W Avg
CSFU [8] OCDA 38.9 38.6 37.9 29.1 36.1
DACS [25] UDA 39.7 37.0 40.2 30.7 36.9
RobustNet [4] DG 38.1 38.3 40.5 30.8 37.0
DHC [19] OCDA 39.4 38.8 40.1 30.9 37.5
Ours (w/o Updating) OCDA 41.8 40.9 44.0 32.9 40.0
Ours (w/ Updating) OCDA 42.5 41.7 44.3 34.6 40.8
(a) GTA5 as the source domain.
SYNTHIA
Method Type O C K W Avg
CSFU [8] OCDA 36.2 34.9 32.4 27.6 32.8
DACS [25] UDA 36.8 37.0 37.4 28.8 35.0
RobustNet [4] DG 37.1 38.3 40.1 29.6 36.3
DHC [19] OCDA 38.9 38.0 40.6 30.0 36.9
Ours (w/o Updating) OCDA 41.5 40.3 42.7 30.1 38.7
Ours (w/ Updating) OCDA 42.6 41.1 43.4 30.9 39.5
(b) SYNTHIA as the source domain.
Table 2: The comparison of mean IoU on the open domains. The domain generalization (DG) model is trained only with the source domain. All the models are tested on the validation set of C-Driving Open (O), cityscapes (C), KITTI (K), and wildDash (W). We also present the scores of our approach without online updating (w/o Updating) and with online updating (w/ Updating).

5.2.2 Generalization to the Open Domains.

We also evaluate the domain generalization of the proposed approach against existing UDA and OCDA approaches. The results are presented in Table 2(a) and 2(b). Our work is compared with the latest domain generalization (DG) approach RobustNet [4]. For all the UDA and OCDA approaches, we first train them with the labeled source and the unlabeled target images, and we evaluate their performance with the validation of the open domains. RobustNet generates of mean IoU in Table 2(a) and of mean IoU in Table 2(b). Note that RobustNet only requires labeled source data during training. This shows that the DG approach is more effective in generalizing to the open domains than the existing UDA and OCDA approaches DACS and CSFU. Without any online updating, our approach achieves of mean IoU in Table 2(a) and of mean IoU in Table 2(b). Our approach outperforms all the UDA approaches, OCDA approaches, and the DG approach listed in the table. The reason might be that our approach is more powerful for learning the domain invariant features which improve the generalization of the model toward novel domains. The performance gain of our approach with updating further shows the efficacy of the proposed online updating with consistency regularization.

5.3 Ablation Study

5.3.1 Generating Optimal Subdomains.

Figure 3: We conduct the ablation study on the proposed automatic domain separation using GTA5C-Driving with ResNet101 backbone. (a) The scatterplot shows the correlation between our approach’s mean IoU and the Silhouette Coefficient score. (b) The mean IoU of our approach with different number of subdomains . (c) The sample images from the subdomains of the C-Driving dataset.

We first conduct the ablation study on the correlation between the mean IoU of the proposed approach with Silhouette Coefficient (SC) score on the subdomain separation in Figure 3(a). It shows a positive correlation, which means that the SC score is effectively finds the optimal number of subdomains for the compound target domain. Moreover, we evaluate the mean IoU score with the different number of subdomains in Figure 3(b). Finally, we set up and present the sample images from the subdomains of the C-Driving dataset in Figure 3(c). We also evaluate the efficacy of the proposed subdomain style purification (SSP) in Table 3(b). Without using SSP, the performance drops of mean IoU.

GTA5C-Driving
Model mIoU
DACS [25] 36.6
DACS + Multi-teacher Learning 39.1
DACS + Bidirectional Mixing 37.3
DACS + Photometric Mixing () 37.4
DACS + Bidirectional Photometric Mixing 37.8
Ours 40.2
(a) The performance gain.
GTA5C-Driving
Configuration mIoU Gap
w/o Multi-teacher Learning 38.0 -2.2
w/o Mixing on One Direction () 38.5 -1.7
w/o Mixing on One Direction () 38.9 -1.3
w/o Subdomain Style Purification 39.7 -0.5
w/o Adaptive Distillation 39.6 -0.6
Full Framework 40.2 -
(b) The performance drop.
Table 3: The ablation study on the efficacy of the components of our model. (a) We compare with one baseline model DACS [25] and evaluate the performance gain of the bidirectional photometric mixing and the multi-teacher learning. (b) We evaluate the performance drop of our model by removing each component from it. Our model is trained GTA5C-Driving with ResNet101 backbone and tested on C-Driving validation set.

5.3.2 Multi-teacher and Single Model.

The ablation study on the multi-teacher learning of our proposed approach is presented in Table 3(a) and Table 3(b). Applying a single model in our approach delivers of mean IoU, leading to the the most significant drop , shown in Table 3(b). We further combine DACS with multi-teacher learning, and the mean IoU reaches from to . We argue that utilizing a single model is less effective than the multi-teacher models. Because adapting to one subdomain might hinder the single model from adapting to other dissimilar subdomains. Thus, we employ a multi-teacher framework in which each teacher adapts to one subdomain separately. And the multiple teachers together provide a comprehensive guide to the student model to adapt to all the target subdomains. We further present the qualitative results about the target image prediction maps from each subdomain by the multi-teachers and the single-teacher model in Figure 5.

Figure 4: We compare the mixed images from the source domain to the target domain. (a) the source image; (b) the target image; (c) the mixed images without using photometric transform, and the style inconsistency exists; (d) the mixed images using photometric transform, and the style inconsistency is mitigated; (e) the mask to crop the source image.

5.3.3 Bidirectional Photometric Mixing.

We further conduct the ablation study for the bidirectional photometric mixing (BPM), shown in Table 3(a) and Table 3(b). Our model is trained on GTA5 C-Driving with ResNet101 backbone and tested on C-Driving validation set. By making to remove the mixing on one direction (ClassMix), the mean IoU drops , while making to remove the other directional mixing (CutMix), it decreases by . This suggests that ClassMix contributes slightly more to the final performance. We also use the baseline model DACS for an in-depth analysis. We add the bidirectional photometric mixing with the DACS, the performance increase from to shown in Table 3(a); we then combine DACS with only bidirectional mixing, the mean IoU rise up to ; we further add DACS with only photometric transform on mixing (use and ), the mean IoU reaches to . The reason behind is that DACS utilizes a simple mixing method that contains only one direction and generates the mixed image with the style inconsistency inside. However, we propose a bidirectional mixing scheme and apply the photometric transform to mitigate the style inconsistency on the generated images. We present the qualitative results to show this issue in Figure 4. The style inconsistency is mitigated in Figure 4(d) compared with Figure 4(c) on the mixing direction from the source domain to the target domain.

Figure 5: We present the predicted segmentation maps of the target images from every target subdomain. The maps in the second row are generated using a single model. The maps in the third row are generated using the multi-teacher models.

6 Conclusion

Open compound domain adaptation (OCDA) considers the target domain as the compound of multiple unknown subdomains. In this work, we first propose automatic domain separation to find the optimal number of subdomains. Then we design a multi-teacher framework with bidirectional photometric mixing to align the domain gap between the source domain and the compound target domain, and we further evaluate its generalization to novel domains. Our current work is only focused on segmentation task and we leave the study on other visual tasks for future research.

7 Acknowledgment

This work was supported by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry & Energy (MOTIE) of the Republic of Korea (No. 20224000000100).

References

  • [1] H. Abu Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and C. Rother (2018)

    Augmented reality meets computer vision: efficient data generation for urban driving scenes

    .
    IJCV 126 (9), pp. 961–972. Cited by: §5.1.1.
  • [2] N. Araslanov and S. Roth (2021) Self-supervised augmentation consistency for adapting semantic segmentation. In CVPR, pp. 15384–15394. Cited by: §2, §5.2, Table 1.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40 (4), pp. 834–848. Cited by: §5.1.2.
  • [4] S. Choi, S. Jung, H. Yun, J. T. Kim, S. Kim, and J. Choo (2021) RobustNet: improving domain generalization in urban-scene segmentation via instance selective whitening. In CVPR, pp. 11580–11590. Cited by: §2, §5.2.2, 2(a), 2(b).
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In CVPR, pp. 3213–3223. Cited by: §5.1.1.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §5.1.2.
  • [7] G. French, A. Oliver, and T. Salimans (2020) Milking cowmask for semi-supervised image classification. arXiv preprint arXiv:2003.12022. Cited by: §5.1.2.
  • [8] R. Gong, Y. Chen, D. P. Paudel, Y. Li, A. Chhatkuli, W. Li, D. Dai, and L. Van Gool (2021) Cluster, split, fuse, and update: meta-learning for open compound domain adaptive semantic segmentation. In CVPR, pp. 8344–8354. Cited by: §1, §2, §5.1.1, §5.2, Table 1, 2(a), 2(b).
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §5.1.2.
  • [10] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In ICML, pp. 1989–1998. Cited by: §2.
  • [11] J. Huang, D. Guan, A. Xiao, and S. Lu (2021) Fsdr: frequency space domain randomization for domain generalization. In CVPR, pp. 6891–6902. Cited by: §2.
  • [12] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In CVPR, pp. 603–612. Cited by: §1.
  • [13] Z. Liu, Z. Miao, X. Pan, X. Zhan, D. Lin, S. X. Yu, and B. Gong (2020) Open compound domain adaptation. In CVPR, pp. 12406–12415. Cited by: §1, §2, §5.1.1, §5.2, Table 1.
  • [14] H. Ma, X. Lin, Z. Wu, and Y. Yu (2021) Coarse-to-fine domain adaptive semantic segmentation with photometric alignment and category-center regularization. In CVPR, pp. 4051–4060. Cited by: §3.1.
  • [15] V. Olsson, W. Tranheden, J. Pinto, and L. Svensson (2021)

    Classmix: segmentation-based data augmentation for semi-supervised learning

    .
    In WACV, pp. 1369–1378. Cited by: §4.1, §5.1.2.
  • [16] C. Ouyang, C. Biffi, C. Chen, T. Kart, H. Qiu, and D. Rueckert (2020) Self-supervision with superpixels: training few-shot medical image segmentation without annotation. In ECCV, pp. 762–780. Cited by: §1.
  • [17] F. Pan, I. Shin, F. Rameau, S. Lee, and I. S. Kweon (2020) Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In CVPR, pp. 3764–3773. Cited by: §1, §2.
  • [18] X. Pan, P. Luo, J. Shi, and X. Tang (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In ECCV, pp. 464–479. Cited by: §2.
  • [19] K. Park, S. Woo, I. Shin, and I. S. Kweon (2020) Discover, hallucinate, and adapt: open compound domain adaptation for semantic segmentation. In NeurIPS, Cited by: §1, §2, §5.1.1, §5.2, Table 1, 2(a), 2(b).
  • [20] C. G. Rafael, E. W. Richard, L. E. Steven, R. Woods, and S. Eddins (2010) Digital image processing using matlab. Tata McGraw-Hill. Cited by: §3.2.
  • [21] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV, pp. 102–118. Cited by: §1, §5.1.1.
  • [22] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez (2016) The SYNTHIA Dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §1, §5.1.1.
  • [23] P. J. Rousseeuw (1987)

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    .
    JCAM 20, pp. 53–65. Cited by: Figure 1, §3.1.
  • [24] C. Sakaridis, D. Dai, and L. Van Gool (2021) ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding. In ICCV, pp. 10765–10775. Cited by: §5.1.1.
  • [25] W. Tranheden, V. Olsson, J. Pinto, and L. Svensson (2021) DACS: domain adaptation via cross-domain mixed sampling. In WACV, pp. 1379–1389. Cited by: §1, §2, §4.1, §5.2, Table 1, 2(a), 2(b), Table 3, 3(a).
  • [26] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §1.
  • [27] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, pp. 7472–7481. Cited by: §2.
  • [28] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, Cited by: §1, §2.
  • [29] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020) Differential treatment for stuff and things: a simple unsupervised domain adaptation method for semantic segmentation. In CVPR, pp. 12635–12644. Cited by: §2.
  • [30] X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In ICCV, pp. 2100–2110. Cited by: §2.
  • [31] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019)

    Cutmix: regularization strategy to train strong classifiers with localizable features

    .
    In CVPR, pp. 6023–6032. Cited by: §4.1, §5.1.2.
  • [32] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. F. Dominguez (2018) Wilddash-creating hazard-aware benchmarks. In ECCV, pp. 402–416. Cited by: §5.1.1.
  • [33] P. Zhang, B. Zhang, T. Zhang, D. Chen, Y. Wang, and F. Wen (2021) Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. CVPR. Cited by: §1, §2.
  • [34] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018)

    Shufflenet: an extremely efficient convolutional neural network for mobile devices

    .
    In CVPR, pp. 6848–6856. Cited by: §1.
  • [35] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §1.
  • [36] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pp. 6881–6890. Cited by: §1.
  • [37] Y. Zou, Z. Yu, B. Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, pp. 289–305. Cited by: §1, §2.

8 Subdomain Style Purification and the t-SNE Visualization

Figure 6: (a) presents the noisy samples from Subdomain 2 of C-Driving dataset before subdomain style purification (before SSP) and after subdomain style purification (after SSP). (b) shows the t-SNE visualization of the concatenated histograms of the C-Driving dataset on LAB color space when .

As mentioned in Section 3.2, it is hard to guarantee that the images from the same target subdomain have the same style. In other words, small domain gaps might still results from the various image styles in each subdomain. We propose subdomain style purification to unify the styles of the target data that belongs to the same subdomain so that the domain gaps in these images could be further reduced. We provide the visualization of the sample images transformed by subdomain style purification (SSP) from subdomain 2 in Figure 6 (a). Note that the images from before SSP in Figure 6(a) has the styles different from the standard style, and they are transformed into the standard style with the help of histogram matching on the LAB color space. We further set up and present the t-SNE visualization of the concatenated histograms of the C-Driving images from LAB color space.

The reason of subdomain style purification (SSP). With the help of automatic domain separation, the number of abnormal samples with different styles is small. Though these abnormal samples might be helpful for the model’s generalization, they could also lead to a negative transfer, which further hinders the model from learning domain invariant features in a specific subdomain. With GTA5C-Driving, we get a of mIoU drop on average over all the subdomains without using SSP, as shown in Table 3(b).

9 ACDC Dataset

We also evaluate the proposed approach on another ACDC dataset[24]. ACDC dataset contains real-world images from the road scenes in diverse weather conditions, including fog, nighttime, rain and snow. We consider the images of fog, nighttime and rain from the training split of ACDC as the compound domain; the snow images with pixel-wise annotations of ACDC training split are taken as the open domain. The final performance is evaluated on the validation set of ACDC, which contains images with ground-truth maps.

(a) GTA5ACDC
Compound Open
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mIoU mIoU
Source - 43.6 2.5 46.2 5.2 0.1 30.3 15.3 16.3 56.9 0.0 71.5 16.3 13.7 51.4 0.0 15.1 0.0 1.4 4.2 20.5 27.1
CDAS[13] OCDA 53.2 5.9 56.1 10.1 2.6 22.0 37.1 11.4 53.9 23.5 71.3 27.6 14.6 47.5 16.8 19.5 0.0 3.2 3.8 25.3 29.1
CSFU[8] OCDA 47.0 4.1 53.0 13.9 1.0 23.2 41.2 18.8 55.8 23.2 72.1 31.5 10.8 69.1 26.4 27.8 0.2 1.7 2.6 27.6 30.5
SAC[2] UDA 42.6 4.2 57.6 11.9 3.8 23.0 49.7 23.8 63.6 31.9 76.0 30.3 10.5 65.3 23.6 23.1 0.1 0.7 3.2 28.7 33.6
DACS[24] UDA 48.9 9.7 54.5 16.8 5.7 22.7 42.0 22.9 61.3 29.7 73.7 32.2 11.6 63.3 23.2 26.5 0.0 1.2 5.2 29.0 34.8
DHA[19] OCDA 49.8 5.2 59.1 10.2 3.1 25.6 47.8 27.9 65.1 32.0 75.2 29.0 12.2 61.5 20.5 32.4 0.0 1.0 2.0 29.5 37.5
Ours OCDA 48.4 5.0 58.2 25.3 10.0 35.1 50.4 26.7 66.8 33.3 75.8 32.1 16.7 73.5 16.8 26.6 0.2 3.9 4.6 32.1 41.6
(b) SYNTHIAACDC
Compound Open
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

sky

person

rider

car

bus

mbike

bike

mIoU mIoU
Source - 45.2 0.2 36.7 1.7 0.6 25.7 4.0 5.6 46.6 64.3 16.9 11.3 39.6 16.5 0.6 1.9 19.8 20.5
CDAS[13] OCDA 61.3 0.7 60.1 11.7 1.8 28.4 18.8 23.5 48.6 28.9 16.5 15.9 69.2 18.4 5.4 5.6 25.9 23.3
CSFU[8] OCDA 62.6 0.3 60.3 8.6 1.8 21.3 20.7 29.1 44.5 22.1 34.5 19.0 71.1 23.2 4.4 4.3 26.7 24.8
SAC[2] UDA 69.8 0.4 56.2 1.7 0.0 20.0 12.6 13.7 52.5 78.1 29.1 15.5 68.9 20.9 3.2 1.2 27.7 25.4
DACS[24] UDA 55.6 1.1 55.7 0.1 0.7 25.8 31.7 18.3 65.5 53.7 31.1 16.6 69.2 22.5 2.9 3.1 28.3 27.0
DHA[19] OCDA 55.5 1.1 57.2 0.7 0.8 26.6 22.7 24.6 65.8 58.4 29.6 23.9 70.8 19.5 5.4 4.2 29.2 27.3
Ours OCDA 66.7 1.7 62.4 10.8 1.4 30.8 23.9 29.2 62.6 69.0 31.6 14.6 71.8 22.9 6.8 4.5 31.9 29.1
Table 4: The performance comparison of mean IoU on the compound target domain (fog, nighttime, and rain) and the open domain (fog) of ACDC. Our approach is compared with the state-of-the-art UDA and OCDA approaches on (a) GTA5ACDC and (b) SYNTHIAACDC benchmark dataset with ResNet-101 as the backbone.

We present the performance comparison of mean IoU in Table 4. For the compound target domain of ACDC (fog, nighttime, rain), we achieve of mean IoU on GTA5ACDC and of mean IoU on SYNTHAIACDC, outperforming all the UDA and OCDA approaches in the list. We also evaluate the generalization of our approach compared with other works. After finishing the compound domain adaptation training, all the models are directly tested on the open domain of ACDC (snow). Note that the snow images have never been used in training before. Under the benchmark datasets GTA5ACDC and SYNTHIAACDC, our approach shows and of mean IoU. This demonstrates that our approach has better generalization ability toward novel domains (snow).

(a) ImageNet pre-trained VGG-16 Backbone
Method Compound (C) Open (O) Average
Rainy Snowy Cloudy Overcast C C+O
CDAS[13] 23.8 25.3 29.1 31.0 26.1 27.3
CSFU[8] 24.5 27.5 30.1 31.4 27.7 29.4
DACS [24] 26.8 29.2 35.1 35.9 30.4 31.8
DHA[19] 27.1 30.4 35.5 36.1 32.0 32.3
Ours 34.5 35.8 39.9 40.1 36.7 37.5
(b) Mixing Algorithm Comparison
Algorithm BPM (Ours) ClassMix [15] CutMix [31] CowMix [7]
mIoU 40.2 39.1 37.6 37.4
Table 5: The evaluation on GTA5C-Driving.

10 The Practicability of Our Approach

Though we use the multi-teacher models for training, our approach still has strong practicability for the two following reasons: these teacher models are trained simultaneously; only a single student model from distillation is needed for inference. The size of the student model is not affected by the number of the subdomains. With the number of the subdomains , the FLOPS and the number of parameters of our multi-teacher’s model are and . After the adaptive knowledge distillation, the FLOPS and number of parameters of our student model is and .

The VGG-16 backbone and different mixup algorithms. We use VGG-16 backbone network for evaluation. The experimental results on GTA5C-Driving in Table 5(a) demonstrates the effectiveness of our approach against existing works with ImageNet pre-trained VGG-16 as the backbone. We provide the comparison to existing domain mixup algorithms in the same setting, including ClassMix[15], CutMix[31], and CowMix[7].

The online updating on the open domains. Our online updating is conducted on each sample from the open domain, thus it is still domain generalization at the testing stage. Our student model is trained through the adaptive distillation from all the subdomain’s segmentation models (Eq. (10, 11)). Each is optimized by Eq. (7) with the help of the mean teacher , following the work of DACS[24]. We also used instead of for distillation but do not see significant performance gain.

The reason of using bidirectional mixing. The performance of using pseudo-labels of target data for ClassMix. Using the photometric transform (Eq.(6)) on target-to-source (t2s) mixing, we enforce the consistency of prediction between the target and the mixed image, which are taken as additional augmentation to improve the model’s performance (Table.3(a,b)). With the experiment on GTA5C-Driving, we get of mIoU on using pseudo-labels of target data for ClassMix on target-to-source mixing, similar to ours (Table 1(a)). Table 5 (b) shows that our BPM outperforms existing mixing algorithms ClassMix [15], CutMix [31], and CowMix [7].