DeepAI
Log In Sign Up

ADAS: A Direct Adaptation Strategy for Multi-Target Domain Adaptive Semantic Segmentation

03/14/2022
by   Seunghun Lee, et al.
DGIST
0

In this paper, we present a direct adaptation strategy (ADAS), which aims to directly adapt a single model to multiple target domains in a semantic segmentation task without pretrained domain-specific models. To do so, we design a multi-target domain transfer network (MTDT-Net) that aligns visual attributes across domains by transferring the domain distinctive features through a new target adaptive denormalization (TAD) module. Moreover, we propose a bi-directional adaptive region selection (BARS) that reduces the attribute ambiguity among the class labels by adaptively selecting the regions with consistent feature statistics. We show that our single MTDT-Net can synthesize visually pleasing domain transferred images with complex driving datasets, and BARS effectively filters out the unnecessary region of training images for each target domain. With the collaboration of MTDT-Net and BARS, our ADAS achieves state-of-the-art performance for multi-target domain adaptation (MTDA). To the best of our knowledge, our method is the first MTDA method that directly adapts to multiple domains in semantic segmentation.

READ FULL TEXT VIEW PDF

page 1

page 2

page 7

12/10/2020

Exploiting Diverse Characteristics and Adversarial Ambivalence for Domain Adaptive Segmentation

Adapting semantic segmentation models to new domains is an important but...
12/16/2022

CD-TTA: Compound Domain Test-time Adaptation for Semantic Segmentation

Test-time adaptation (TTA) has attracted significant attention due to it...
11/26/2022

DynaGAN: Dynamic Few-shot Adaptation of GANs to Multiple Domains

Few-shot domain adaptation to multiple domains aims to learn a complex i...
03/12/2022

MDT-Net: Multi-domain Transfer by Perceptual Supervision for Unpaired Images in OCT Scan

Deep learning models tend to underperform in the presence of domain shif...
04/05/2019

Semantic Attribute Matching Networks

We present semantic attribute matching networks (SAM-Net) for jointly es...
03/24/2021

DRANet: Disentangling Representation and Adaptation Networks for Unsupervised Cross-Domain Adaptation

In this paper, we present DRANet, a network architecture that disentangl...

1 Introduction

Unsupervised domain adaptation (UDA)[lee2018diverse, shen2019towards, hoffman2018cycada, jeong2021memory, lee2021dranet]

aims to alleviate the performance drop caused by the distribution discrepancy between domains. It is widely utilized in synthetic-to-real adaptation for various computer vision applications that require a large number of labeled data. Most of the works are designed for single-target domain adaptation (STDA), which allows a single network to adapt to a specific target domain. It rarely addresses the variability in the real-world, particularly changes in driving region, illumination, and weather conditions in autonomous driving scenarios. This issue can be tackled by adopting multiple target-specific adaptation models, but this limits the memory efficiency, scalability, and practical utility of embedded autonomous systems.

Recently, multi-target domain adaptation (MTDA) methods [yu2018multi, peng2019domain, gholami2020unsupervised, nguyen2021unsupervised, isobe2021multi, saporta2021multi] have been proposed, which enables a single model to adapt a labeled source domain to multiple unlabeled target domains. Most of works train multiple STDA models and then distill the knowledge into a single multi-target domain adaptation network. Recent approaches[nguyen2021unsupervised, isobe2021multi, saporta2021multi] transfer the knowledge from label predictors as shown in Fig. 2-(a). These methods show impressive results, but their performance can be restricted by the performance of the pretrained models. Moreover, inaccurate label predictions in the teacher network can degrade model performance, but none of works have deeply investigated them. To address this problem, we propose A Direct Adaptation Strategy (ADAS) that directly adapts a single model to multiple target domains without pretrained STDA models, as shown in Fig. 2-(b). Our approach achieves robust multi-domain adaptation by exploiting the feature statistics of training data on multiple domains. The followings provides a detailed introduction of our sub-modules: Multi-Target Domain Transfer Networks (MTDT-Net) and a Bi-directional Adaptive Region Selection (BARS).

(a) Existing MTDA method (b) Ours
Figure 2: Illustration of the existing MTDA and our method. (a) Conventional MTDA methods pretrain each STDA model then distill the knowledge into a single MTDA model. (b) Our ADAS directly adapts multiple target domains.

MTDT-Net We present a Multi-Target Domain Transfer Network (MTDT-Net) that transfers the distinctive attribute of target domains to a source domain rather than learning all of the target domain distributions. Our network consists of a novel Target Adaptive Denormalization (TAD) that helps to adapt the statistics of source feature to that of the target feature. While the existing works on UDA [liu2016coupled, bousmalis2017unsupervised, murez2018image, hoffman2018cycada, chen2019learning, chen2019crdoco, ma2021coarse] require domain-specific encoders and generators for multi-target domain adaptation, the TAD module enables our single network to adapt to multiple domains. Fig. 1 shows how a single MTDT-Net can efficiently synthesize visually pleasing domain transferred images.

GTA5 Cityscapes Cityscapes
(a) Domain transferred image (b) Target image
(c) Ground-truth label of (a) (d) Pseudo label of (b)
(e) Selected region of (c) (f) Selected region of (d)
Figure 3: Examples of the regions with similar attributes but different labels (c) (purple: road, pink: sidewalk), and the noisy prediction (d). The black regions in (e) and (f) are regions filtered by BARS.

BARS Although the visual attributes across domains are well-aligned, there are still some attribute ambiguities among the class labels in semantic segmentation. The ambiguity is usually observed on the regions with similar attributes but different label, such as the sidewalks in GTA5 and the roads in Cityscapes as shown in Fig. 3-(a),(c). This confuses the model finding the accurate decision boundary. Moreover, the predictions from target domains usually have noisy labels leading to inaccurate training of the task network, as shown in Fig. 3-(b),(d). To solve these issues, we propose a Bi-directional Adaptive Region Selection (BARS), which alleviates the confusion. It adaptively selects the regions with consistent feature statistics as shown in Fig. 3-(e). It can also select the pseudo label of the target images for our self-training scheme, as shown in Fig. 3-(f). We show that BARS allows the task network to perform robust training and achieve the improved performance.

To the best of our knowledge, our multi-target domain adaptation method is the first approach that directly adapts the task network to multiple target domains without pretrained STDA models in semantic segmentation. The extensive experiments show that the proposed method achieves state-of-the-art performance on semantic segmentation task. At the end, we demonstrate the effectiveness of the proposed MTDT-Net and BARS.

2 Related Work

2.1 Domain Transfer

With the advent of generative adversarial networks (GANs)

[goodfellow2014generative], the adversarial learning has shown promising results not only in photorealistic image synthesis[radford2015unsupervised, arjovsky2017wasserstein, miyato2018spectral, mao2017least, karras2017progressive, karras2019style, brock2018large, karras2020analyzing] but also in domain transfer[isola2017image, zhu2017unpaired, liu2017unsupervised, taigman2016unsupervised, huang2018multimodal, lee2018diverse, liu2019few, shen2019towards, hoffman2018cycada, jeong2021memory, lee2021dranet]. The traditional domain transfer methods rely on adversarial learning[isola2017image, zhu2017unpaired, liu2017unsupervised, taigman2016unsupervised] or the typical style transfer method[gatys2016image, ulyanov2016instance, dumoulin2016learned, nam2018batch, huang2017arbitrary]. Afterwards, the studies on feature disentanglement [huang2018multimodal, lee2018diverse, liu2019few, park2020swapping, lee2021dranet] present a model that can apply various styles by appropriately utilizing disentangled features that are separately encoded as content and style. Recent works[richter2021enhancing, zhu2020sean] have tackled more in-depth domain transfer problems. Richter et al. [richter2021enhancing]

propose a rendering-aware denormalization (RAD) that constructs style tensors by using the abundant condition information from G-buffers, and show high fidelity domain transfer in a driving scene. Zhu

et al. [zhu2020sean]

propose a semantic region-wise domain transfer model by extracting a style vector for each semantic region.

2.2 Unsupervised Domain Adaptation for Semantic Segmentation

Traditional feature-level adaptation methods [hoffman2016fcns, luo2019significance, luo2019taking, tsai2018learning, vu2019advent] aim to align the source and target distribution in feature space. Most of them [hoffman2016fcns, luo2019significance, luo2019taking] adopt adversarial learning with the intermediate features of the segmentation network, and the others [tsai2018learning, vu2019advent] directly apply adversarial loss to output prediction. Pixel-level adaptation methods[liu2016coupled, bousmalis2017unsupervised, murez2018image, lee2021dranet] reduce the domain gap in the image-level by synthesizing target-styled images. Several works [hoffman2018cycada, chen2019crdoco, chen2019learning] adopt both feature-level and pixel-level methods. Another direction of UDA [zou2018unsupervised, zheng2021rectifying, li2019bidirectional, ma2021coarse, zhang2021prototypical]

is to take a self-supervised learning approach for dense prediction tasks, such as semantic segmentation. Some works

[zou2018unsupervised, zheng2021rectifying, li2019bidirectional] obtain high confidence labels measured by the uncertainty of target prediction, and use them as pseudo ground-truth. The others [ma2021coarse, zhang2021prototypical] use proxy features by extracting the centroid of the intermediate features of each class to remove uncertain regions in the pseudo-label.

2.3 Multi-Target Domain Adaptation

Early studies on MTDA have tackled classification tasks using adaptive learning of a common model parameter dictionary [yu2018multi]

, domain-invariant feature extraction 

[peng2019domain], or knowledge distillation [nguyen2021unsupervised]. Recently, using MTDA on more high-level vision tasks such as semantic segmentation [isobe2021multi, saporta2021multi] has become an interesting and challenging research topic. These works employ knowledge distillation to transfer the knowledge of domain specific teacher models to a domain-agnostic student model. For more robust adaptation, Isobe et al. [isobe2021multi] enforce the weight regularization to the student network and Saporta et al. [saporta2021multi] use a shared feature extractor that constructs a common feature space for all domains. In this work, we present a more efficient and simpler method that handles multiple domains using a unified architecture without teacher networks or weight regularization.

Figure 4: Overview of the proposed MTDT-Net. (a) MTDT-Net consists of an encoder , a style encoder , a domain style transfer network and a generator . Given a source image, label map , and target images , MTDT-Net aims to produce domain transferred image . The other reconstructed images are auxiliary outputs generated only during the training process. (b) consists of two TAD residual blocks (ResBlock). The TAD module is followed by each convolutional layer, given the channel-wise statistics of target domains . (c) TAD transfers the target domain with by statistics modulation. (d) The multi-head discriminator predicts which domain the image is from, as well as determines whether the image is real or fake. Note that, for the sake of brevity, we illustrate a single target domain setting, but our model deals with multi-target domain adaptation.

3 A Direct Adaptation Strategy (ADAS)

In this section, we describe our direct adaptation strategy for multi-target domain adaptive semantic segmentation. We have a labeled source dataset and unlabeled target datasets , where and are the image and the ground-truth label, respectively. The goal of our approach is to directly adapt a segmentation network to multiple target domains without training STDA models. Our strategy contains two sub-modules: a multi-target domain transfer network (MTDT-Net), and a bi-directional adaptive region selection (BARS). We describe the details of MTDT-Net and BARS in Sec. 3.1 and Sec. 3.2, respectively.

3.1 Multi-Target Domain Transfer Network (MTDT-Net)

The overall pipeline of MTDT-Net is illustrated in Fig. 4-(a). The network consists of an encoder , a generator , a style extractor , a domain style transfer network

. To build an image feature space, we adopt a typical autoencoder structure with the encoder

and the generator . Given the source and target images , the encoder extracts the individual features that are later passed through the generator to reconstruct the original input images as follows:

(1)

We extract the style tensors , of the source image through the style encoder , and the content tensor from the segmentation label only in the source domain through an convolutional layer as follows:

(2)

We assume that the image features are composed of the scene structure and detail representation, which we call the content feature and style feature , as follows:

(3)

where the source image feature is passed through generator to obtain the reconstructed input image . The synthesized images are auxiliary outputs to be utilized for network training. The goal of our network is to generate a domain transferred image using the same generator as follows:

(4)

where is the domain transferred features, which is composed of the source content and the -th target domain style features .

To obtain the target domain style tensors, we design a domain style transfer network () which transfers the source style tensors to the target style tensors as follows:

(5)

where the channel-wise mean

and variance

vectors encode the -th target domain feature statistics computed by the cumulative moving average (CMA) algorithm and Welford’s online algorithm [welford1962note] described in Alg. 1. The in Fig. 4

-(b) consists of two TAD ResBlock built with a series of convolutional layer, our new Target-Adaptive Denormalization (TAD), and ReLU. TAD is a conditional normalization module that modulates the normalized input with learned scale and bias similar to SPADE

[park2019semantic] and RAD[richter2021enhancing] as shown in Fig. 4

-(c). We pass the standard deviation

and the target mean through each fully connected (FC) layer and use them as scale and bias as follows:

(6)

where is the instance-normalized [ulyanov2016instance] input to TAD. For adversarial learning with multiple target domains, we adopt a multi-head discriminator composed of an adversarial discriminator

and a domain classifier

as shown in Fig. 4-(d).

Each group of networks and is trained by minimizing the following losses, and , respectively:

(7)

Reconstruction Loss We impose L1 loss on the reconstructed images to build an image feature space:

(8)

Adversarial Loss We apply the patchGAN [isola2017image] discriminator to impose an adversarial loss on the domain transferred images and the corresponding target images:

(9)
0:  
0:   % 1. Initialization
1:  for  do
2:     , //
3:  end for% 2. Online update // is # of update iterations
4:  for  do
5:     for  do
6:        
7:        
8:        
9:         expand to
10:        if  then
11:           
12:        else
13:           
14:           
15:           
16:        end if
17:     end for
18:  end for
Algorithm 1 Domain feature statistics extraction

Domain Classification Loss We build the domain classifier to classify the domain of the input images. We impose the cross-entropy loss with the target images for and with the domain transferred images for :

(10)

where

is the one-hot encoded class label of the target domain

.

Perceptual Loss We impose a perceptual loss [johnson2016perceptual] widely used for domain transfer as well as style transfer [dumoulin2016learned, huang2017arbitrary]:

(11)

where the set of layers is the subset of the perceptual network .

Figure 5: Overview of BARS. For each class , BARS extracts the centroids from the intermediate features of the segmentation network with RoI pooling and update them with CMA algorithm. Then, BARS measures the similarity of two cases, “” and “”, and selects the adaptive region. m⃝ is a switch that selects the labels for centroid update in Equ. (12), either , for the first iterations or , after the iterations. We set m as 300 iterations for our experiments.
Method Target flat constr. object nature sky human vehicle mIoU Avg.
G C, I ADVENT [vu2019advent] C 93.9 80.2 26.2 79.0 80.5 52.5 78.0 70.0 67.4
I 91.8 54.5 14.4 76.8 90.3 47.5 78.3 64.8
MTKT[saporta2021multi] C 94.5 82.0 23.7 80.1 84.0 51.0 77.6 70.4 68.2
I 91.4 56.6 13.2 77.3 91.4 51.4 79.9 65.9
Ours C 95.1 82.6 39.8 84.6 81.2 63.6 80.7 75.4 71.2
I 90.5 63.0 22.2 73.7 87.9 54.3 76.9 66.9
1-122.5 G C, M ADVENT[vu2019advent] C 93.1 80.5 24.0 77.9 81.0 52.5 75.0 69.1 68.9
M 90.0 71.3 31.1 73.0 92.6 46.6 76.6 68.7
MTKT[saporta2021multi] C 95.0 81.6 23.6 80.1 83.6 53.7 79.8 71.1 70.9
M 90.6 73.3 31.0 75.3 94.5 52.2 79.8 70.8
Ours C 96.4 83.5 35.1 83.8 84.9 62.3 81.3 75.3 73.9
M 88.6 73.7 41.0 75.4 93.4 58.5 77.2 72.6
1-122.5 G C, I, M ADVENT[vu2019advent] C 93.6 80.6 26.4 78.1 81.5 51.9 76.4 69.8 67.8
I 92.0 54.6 15.7 77.2 90.5 50.8 78.6 65.6
M 89.2 72.4 32.4 73.0 92.7 41.6 74.9 68.0
MTKT[saporta2021multi] C 94.6 80.7 23.8 79.0 84.5 51.0 79.2 70.4 69.1
I 91.7 55.6 14.5 78.0 92.6 49.8 79.4 65.9
M 90.5 73.7 32.5 75.5 94.3 51.2 80.2 71.1
Ours C 95.8 82.4 38.3 82.4 85.0 60.5 80.2 74.9 71.3
I 89.9 52.7 25.0 78.1 92.1 51.0 77.9 66.7
M 89.2 71.5 45.2 75.8 92.3 56.1 75.4 72.2
Table 1: Quantitative comparison between our method and state-of-the-art methods on GTA5 (G) to Cityscapes (C), IDD (I), and Mapilary (M) with 7 classes setting. Bold: Best score among all the methods.

3.2 Bi-directional Adaptive Region Selection (BARS)

The key idea of BARS is to select the pixels where the feature statistics are consistent, then train a task network by imposing loss on the selected region as illustrated in Fig. 5. We apply it in both the domain transferred image and the target image. We first extract each centroid feature of class as follows:

(12)

where is an indicator function, is the number of pixels of class , and are the indices of the spatial coordinates. The feature map is from the second last layer of the task network . To extract the centroids, we use the ground-truth label of the domain transferred image and the pseudo label of the target image. For the online learning with the centroids, we also apply the CMA algorithm in Alg. 1 to the above centroids. Then, we design the selection mechanism using the following two assumptions:

  • The region with features far from the target centroid would disturb the adaptation process.

  • The region with target features far from the centroids is likely to be a noisy prediction region.

Based on these assumptions, we find the nearest class for each pixel in the feature map using the L2 distance between features on each pixels and centroid features as follows:

(13)

We obtain the filtered labels , using the nearest class :

(14)

Finally, we train the task network with the labels using a typical cross-entropy loss :

(15)
7 classes 19 classes
Image GT Source only Ours GT Source only Ours
C
I
M
Figure 6: Qualitative comparison between source only and our method on GTA5 (G) to Cityscapes (C), IDD (I), and Mapillary (M) with 7 classes and 19 classes setting.

4 Experiments

In this section, we describe the implementation details and experimental results of the proposed ADAS. We evaluate our method on a semantic segmentation task in both the synthetic-to-real adaptation in Sec. 4.2 and the real-to-real adaptation in Sec. 4.3 with multiple target domain datasets. We also conduct an extensive study to validate each submodule, MTDT-Net and BARS, in Sec. 4.4.

Method mIoU mIoU
C I M Avg.
GC, I CCL[isobe2021multi] 45.0 46.0 - 45.5
Ours 45.8 46.3 - 46.1
GC, M CCL[isobe2021multi] 45.1 - 48.8 46.8
Ours 45.8 - 49.2 47.5
GI, M CCL[isobe2021multi] - 44.5 46.4 45.5
Ours - 46.1 47.6 46.9
GC, I, M CCL[isobe2021multi] 46.7 47.0 49.9 47.9
Ours 46.9 47.7 51.1 48.6
Table 2: Results of adapting GTA5 to Cityscapes (C), IDD (I), and Mapillary (M) with 19 classes setting.

4.1 Training Details

Datasets We use four different driving datasets containing one synthetic and three real-world datasets, each of which has a unique scene structure and visual appearance.

  • GTA5[richter2016playing] is a synthetic dataset of 24,966 labeled images captured from a video game.

  • Cityscapes[cordts2016cityscapes] is an urban dataset collected from European cities, and includes 2,975 images in the training set and 500 in the validation set.

  • IDD[varma2019idd] has total 10,003 Indian urban driving scenes, which contains 6,993 images for training, 981 for validation and 2,029 for test.

  • Mapillary Vista[neuhold2017mapillary] is a large-scale dataset that contains multiple city scenes from around the world with 18,000 images for training and 2,000 for validation.

For a fair comparison with the recent MTDA methods [vu2019advent, saporta2021multi, isobe2021multi]

, we follow the segmentation label mapping protocol of 19 classes and super classes (7 classes) proposed in the papers. We use mIoU (%) as evaluation metric for all domain adaptation experiments.

C I,M I C,M M C,I
C
I
M
Figure 7: Real-to-real domain transfer results with Cityscapes (C), IDD (I), and Mapillary (M). Red boxed images are the input.

Implementation Details We use the Deeplabv2+ResNet-101 [chen2017deeplab, he2016deep] architecture for our segmentation network, as used in other conventional works [isobe2021multi, saporta2021multi]. We use the same encoder and generator structure of DRANet [lee2021dranet] with group normalization [wu2018group]. For our multi-head discriminator, we use the patchGAN discriminator [isola2017image]

and two fully connected layers as the domain classifier. We use ImageNet-pretrained VGG19 networks

[simonyan2014very] as the perceptual network and compute the perceptual loss at layer relu4

2. We use a stochastic gradient descent optimizer

[bottou2010large] with a learning rate of , a momentum of 0.9 and a weight decay of for training the segmentation network. We use Adam[kingma2014adam] optimizer with a learning rate of , momentums of 0.9 and 0.999 and a weight decay of for training all the networks in MTDT-Net.

4.2 Synthetic-to-Real Adaptation

We conduct the experiments on synthetic-to-real adaptation with the same settings as competitive methods [saporta2021multi, isobe2021multi]. We use GTA5 as the source dataset and a combination of Cityscapes, IDD and Mapillary as multiple target datasets. We show the qualitative results of the multi-target domain transfer in Fig. 1. This demonstrates that our single MTDT-Net can synthesize high quality images even in multi-target domain scenarios. We report the quantitative results for semantic segmentation with 7 common classes in Tab. 1, and 19 classes in Tab. 2, respectively. The results show that our method, composed of both MTDT-Net and BARS outperforms state-of-the-art methods by a large margin. Compared to ADVENT[vu2019advent] calculating selection criterion value from incorrect target prediction, our BARS derives the criterion robustly from accurate source GT without class ambiguity. Moreover, MTDT-Net aims to transfer visual attributes of domains rather than adapting color information using a color transfer algorithm [reinhard2001color] proposed in CCL[isobe2021multi]. The new attribute alignment method improves the task performance over state-of-the-art methods. Lastly, the qualitative results in Fig. 6 demonstrates that our method produces reliable label prediction maps on both label mapping protocols.

4.3 Real-to-Real Adaptation

To show the scalability of our model, we also conduct an experiment with real-to-real adaptation scenarios. We set one of the real-world datasets, Cityscapes, IDD, and Mapillary, as the source domain and the other two as the target domains. We conduct the experiments with all possible combinations of source and target. The results in Fig. 7 show that our MTDT-Net also produces high fidelity images even across real-world datasets. As shown in Tab. 3, our method outperforms the competitive methods in the overall results. The experiments demonstrate that our method achieves realistic image synthesis not only on synthetic-to-real but also on real-to-real adaptation, which validates the scalability and reliability of our model.

# of Method mIoU mIoU
classes C I M Avg.
CI, M 19 CCL[isobe2021multi] - 53.6 51.4 52.5
Ours - 48.3 53.6 50.5
7 MTKT[saporta2021multi] - 68.3 69.3 68.8
Ours - 70.4 75.1 72.7
IC, M 19 CCL[isobe2021multi] 46.8 - 49.8 48.3
Ours 49.1 - 50.8 50.0
7 MTKT[saporta2021multi] - - - -
Ours 79.5 - 77.9 78.7
MC, I 19 CCL[isobe2021multi] 58.5 54.1 - 56.3
Ours 58.7 54.1 - 56.4
7 MTKT[saporta2021multi] - - - -
Ours 75.8 81.1 - 78.5
Table 3: Results of real-to-real MTDA on all possible combinations among Cityscape (C), IDD (I), and Mapillary (M).

4.4 Further Study on MTDT-Net and BARS

In this section, we conduct additional experiments to validate each sub-module, MTDT-Net and BARS.

MTDT-Net We compare our MTDT-Net with a color transfer algorithm [reinhard2001color] used in CCL [isobe2021multi] and DRANet [lee2021dranet] which are the most recent multi-domain transfer methods. We conduct the experiment on a synthetic-to-real adaptation using GTA5, Cityscapes, IDD and Mapillary as in Sec. 4.2. We train the task network using synthesized images from each method with corresponding source labels. Tab. 4 shows the results for semantic segmentation with 19 classes setting. Among the competitive methods, MTDT-Net shows the best performance. We believe the other two methods hardly transfer the domain-specific attribute of each target dataset. The color transfer algorithm just shifts the distribution of the source image to that of the target image in color space, rather than aligning domain properties. DRANet tries to cover the feature space of each domain using just one parameter, called the domain-specific scale parameter, resulting in unstable learning with multiple complex datasets. On the other hand, MTDT-Net robustly synthesizes the domain transferred images by exploiting the target feature statistics, which facilitate better domain transfer.

BARS To validate the effectiveness of the two filtered labels and , we conduct a set of experiments with/without each component. We train the segmentation network with the output images of MTDT-Net using a full source label in the experiments without . With just the or , the model achieves large improvements in Tab. 5, respectively. However, the region with ambiguous or noisy labels limits the model performance, so the network trained with both filtered labels achieves the best performance.

mIoU mIoU
Method C I M Avg.
Color Transfer [reinhard2001color] 33.8 37.4 42.1 37.8
DRANet [lee2021dranet] 37.3 39.3 43.2 39.9
MTDT-Net 41.4 40.6 44.1 42.0
Table 4: Comparison of MTDA-Net with competitive methods on synthetic-to-real adaptation with 19 classes setting.
mIoU mIoU
C I M Avg.
41.4 40.6 44.1 42.0
43.1 44.0 46.9 44.7
45.0 44.9 47.5 45.8
46.9 47.7 51.1 48.6
Table 5: Ablation study of BARS on synthetic-to-real adaptation with 19 classes setting.

5 Conclusion

In this paper, we present ADAS, a new approach for multi-target domain adaptation, which directly adapts a single model to multiple target domains without relying on the STDA models. For the direct adaptation, we introduce two key components: MTDT-Net and BARS. MTDT-Net enables a single model to directly transfer the distinctive properties of multiple target domains to the source domain by introducing the novel TAD ResBlock. BARS helps to remove the outliers in the segmentation labels of both the domain transferred images and the corresponding target images. Extensive experiments show that MTDT-Net synthesizes visually pleasing images transferred across domains, and BARS effectively filters out the inconsistent region in segmentation labels, which leads to robust training and boosts the performance of semantic segmentation. The experiments on benchmark datasets demonstrate that our method designed with MTDT-net and BARS outperforms the current state-of-the-art MTDA methods.

Acknowledgement This work was supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2014-3-00123, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1C1C1013210).

References