All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation

03/26/2019 ∙ by Wei-Lun Chang, et al. ∙ National Chiao Tung University 0

In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize image-translation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is to predict pixel-level semantic labels for an image. It is considered one of the most challenging tasks in computer vision. Due to the renaissance of deep learning in recent years, we witness a great leap brought to this task. Since the inception of Fully Convolutional Network (FCN), which is built upon pre-trained classification models (e.g. VGG 

[21] and ResNet [7]) and deconvolutional layers, numerous techniques have been proposed to advance semantic segmentation, such as enlarging receptive fields [2, 27] and better preserving contextual information [28]

, to name a few. However, these approaches rely largely on supervised learning, thereby calling for expensive pixel-level annotations.

To circumvent this issue, one solution is to train segmentation models on synthetic data. The computer graphics technology nowadays is able to synthesize high-quality, photo-realistic images for a virtual scene. It is thus possible to build up a dataset for supervised semantic segmentation (e.g. GTA5 [17] and SYNTHIA [18]) based on these synthetic images. During the rendering process, their pixel-level semantic labels are readily available. Nevertheless, segmentation models trained on synthetic datasets often have difficulty achieving satisfactory performance in real-world scenes due to a phenomenon known as domain shift – i.e. synthetic and real-world images can still exhibit considerable difference in their low-level texture appearance.

(a) Conventional domain adaptation
(b) The proposed method
Figure 1: Comparison of the conventional domain adaptation for semantic segmentation and our proposed method. Instead of making the entire feature representation domain invariant, we align only the distributions of the structure component across domains.

Domain adaptation is thus proposed to transfer the knowledge learned from a source domain (e.g. synthetic images) to another target domain (e.g. real images). One common approach is to learn a domain-invariant feature space across domains by matching their feature distributions, where different matching criteria have been explored, e.g. minimizing the second order statistics [23] and domain adversarial training [6, 8, 25] . There is also a recent research work [24] which introduces distribution alignment directly in the structural output space for the task of semantic segmentation. However, these approaches are all driven by a strong assumption that the entire feature or output space of two domains can be well aligned (see Figure 1 (a)) to yield a domain-invariant representation that is also discriminative for the tasks in question.

In this paper, we propose a Domain Invariant Structure Extraction (DISE) framework to address unsupervised domain adaptation for semantic segmentation. We hypothesize that the high-level structure information of an image would be the most effective for its segmentation prediction. Thus, our DISE aims to discover a domain-invariant structure feature by learning to disentangle domain-invariant structure information of an image from its domain-specific texture information, as illustrated in Figure 1 (b).

Our method distinguishes from similar prior works in (1) learning an image representation comprising explicitly a domain-invariant structure component and a domain-specific texture component, (2) making only the structure component domain invariant, and (3) allowing image-to-image translation across domains which further enables label transfer, with all achieved within one single framework. Although DISE shares some parallels with domain separation networks

[1] and DRIT [13], its emphasis on the separation of structure and texture information and the ability to translate images across domains and meanwhile maintain structures clearly highlight the novelties. Extensive experiments on standardized datasets confirm its superiority over several state-of-the-art baselines.

Figure 2: An overview of the proposed domain-invariant structure extraction (DISE) framework for semantic segmentation. The DISE framework is composed of a common encoder shared across domains, two domain-specific private encoders,

, a pixel-wise classifier

, and a shared decoder . It encodes an image, source-domain or target-domain, into a domain-specific texture component and a domain-invariant structure component , as shown in part (a). With this disentanglement, it can translate an image (respectively, ) in one domain to another image (respectively, ) in the other domain by combining the structure content of (respectively, ) with the texture appearance of (respectively, ), as shown in parts (b) and (c). This further enables the transfer of ground-truth labels from the source domain to the target domain, as illustrated in part (d).

2 Related Work

In comparison to image classification where there exist many prior works addressing the domain adaptation problem, semantic segmentation is considered a much more challenging task to apply domain adaptation, since its output is a segmentation map full of highly structured and contextual semantic information. We review several related works here and categorize them according to the use of three widely utilized strategies: distribution alignment, image translation, and label transfer. Different works may differ in their choice and conducting order of these strategies, as contrasted in Table 1.

Firstly, similar to the case of domain adaptation for image classification, different criteria may be applied to match distributions across domains in the feature space (e.g. [9, 20, 26, 30]) or in the output space. The representative work of the latter is proposed by Tsai et al[24], where adversarial learning is applied on segmentation maps, based on spatial contextual similarities between the source and target domains . However, the assumption that the whole feature or output space of the two domains can be well aligned often proves impractical, considering the substantial difference in appearance (namely, texture) between synthetic and real-world images in some applications.

Secondly, the recent advance in image-to-image translation and style transfer [10, 12, 29] has motivated the translation of source images to gain texture appearance of target images, or vice versa. On the one hand, this translation process allows segmentation models to use translated images as augmented training data [8, 26]; on the other hand, the common feature space learned in the course of image translation can facilitate learning a domain-invariant segmentation model [20, 30].

Finally, the image-to-image translation makes possible the transfer of labels from the source domain to the target domain, providing additional supervised signals to learn a model applicable to target-domain images [8, 26]. However, the direct image-translation may be harmful to learning, due to the risk of carrying over source specific information to the target domain.

Our proposed DISE makes use of all three strategies but differs from these prior works in several significant ways. We hypothesize that the high-level structure information of an image would be the most informative for its semantic segmentation. Thus, the DISE is to disentangle high-level, domain-invariant structure information of an image from its low-level, domain-specific texture information through a set of common and private encoders.

Methods IT DA LT Order
Sankaranarayanan et al. [20] ITDA
Hong et al. [9]
Wu et al. [26] ITDALT
Tsai et al. [24]
Chen et al. [3]
Hoffman et al. [8] ITLT, DA
Zhu et al. [30] ITDA
Our DISE DAITLT
Table 1: Different strategies adopted by prior works on domain adaptation for semantic segmentation. IT, DA, LT stand for Image Translation, Distribution Alignment, and Label Transfer, respectively. Order denotes the order in which these strategies are applied.

3 Method

In this paper, we propose a Domain Invariant Structure Extraction (DISE) framework to address the problem of unsupervised domain adaptation for semantic segmentation. The emphasis on explicitly regularizing the common and private encoders towards capturing structure and texture information, along with the ability to translate images from one domain to another for label transfer, underlines the novelties of our method. The following gives a formal treatment of the DISE. We begin by an overview of its framework. Next, we present in detail the loss functions used, followed by a description of implementation details.

3.1 Domain Invariant Structure Extraction

The DISE aims to learn an image representation comprising a domain-invariant structure component and a domain-specific texture component. The setting assumes access to annotated source-domain images , with each image having height , width and -way per-pixel label of object categories , and unannotated target-domain images . As shown in Figure 2 (a), there are five sub-networks in DISE, namely, the common encoder shared across domains, the domain-specific private encoders , the shared decoder , and the pixel-wise classifier . They are parameterized by , , , and , respectively.

Given a source-domain image as input, the common encoder produces to characterize its domain-invariant, high-level structure information while the source-specific private encoder generates for capturing its remnant aspects that are largely related to domain-specific, low-level texture information. These two components are complementary to each other; when combined together, they allow the decoder to minimize a reconstruction loss between the input and its reconstruction . Likewise, a target-domain image can be encoded and decoded similarly to minimize , yielding and , where the private encoder , like its counterpart , extracts the target-specific texture information. It is the structure components that will be used by classifier to predict segmentation maps, in source and target domains accordingly.

The disentanglement between structure and texture information is realized by the regularization coming from image translation with domain adversarial training [14] and perceptual loss minimization [12]. As illustrated in Figure 2 (b) and (c), we consider any pair of source- and target-domain images with their respective representations and . We first interchange their domain-specific components, and then decode them into two unseen, translated images and . If the common and private encoders behave as we expect them to capture the structure and texture information, respectively, the translated image (respectively, ) should hold the high-level structure the same as (respectively, ) while exhibiting similar low-level texture appearance to (respectively, ). To this end, we train our networks by imposing domain adversarial losses [14] and perceptual losses [12] at the output of the decoder in order to ensure the domain and perceptual similarities between these translated images and their counterparts in the source or target domains. This image translation functionality of DISE further allows the transfer of ground-truth labels from the source domain to the target domain. More specifically, since the target-domain-like images share the same structure component as , we consider the ground-truth labels of to be the pseudo labels for on grounds of our hypothesis that the segmentation prediction for an image depends solely on its structure information.

Finally, we make the structure components invariant to the domain from which they are extracted by minimizing another domain adversarial loss at the output of the classifier , as well as the negative log-likelihood functions of the ground-truth labels with respect to and , i.e. and (see Figure 2 (d)).

3.2 Learning

The training of the proposed DISE is to minimize a weighted combination of the aforementioned loss functions with respect to the parameters of the five sub-networks:

(1)

where the combination weights ’s are chosen empirically to strike a balance among the model capacity, reconstruction/translation quality, and prediction accuracy. In the following, we elaborate on each of these loss functions.
Segmentation Loss. The segmentation loss given by the typical cross-entropy based on the source-domain ground truths is to train supervisedly the common encoder and the classifier in order to predict segmentation maps for source-domain images .
Output Space Adversarial Loss. Inspired by Tsai et al. [24], we introduce an adversarial loss at the output of the classifier , in the hopes of making the common encoder and the classifier generalize well on target-domain images. Specifically, we first train a discriminator to distinguish between the source prediction and the target prediction at the patch level [11] by minimizing a supervised domain loss (i.e. should ideally output 1 for each patch in the source prediction and 0 for that in the target prediction ). We then update the common encoder and the classifier to fool the discriminator by inverting its output for from 0 to 1, that is, by minimizing

(2)

where are patch coordinates and with the factor 16 accounting for the downsampling in the discriminator .
Reconstruction Loss. The reconstruction loss is to ensure that the two domain-invariant and domain-specific components of an image representation together form a nearly complete summary of the image. To encourage the reconstruction to be perceptually similar to the input image, we follow the notion of perceptual loss [12] to define our quality metric as a weighted sum of L1 differences between feature representations extracted from a pre-trained VGG network [22]. In symbols, we have

(3)

where (respectively, ) is the activations of the -th layer of the pre-trained VGG network for input (respectively, ), is the number of activations in layer , gives a separate weighting to the loss in layer , and refers to of the VGG network. As pointed out in [12], the higher layers of VGG network tend to represent the high-level structure content of an image while the lower layers generally describe its low-level texture appearance. Equation 3 is then used to regularize the reconstruction of both source- and target-domain images by minimizing the sum of their respective perceptual losses:

(4)

where the weighting is set to weight more on higher layers.
Translation Structure Loss. As motivated previously in Section 3.1, an image produced by translation across domains should keep its structure unchanged. The translation structure loss as defined in Equation 5 measures the differences in high-level structure between the translated image and the image from which the structure component of is derived, and likewise, between and . This is achieved by choosing for the perceptual metric a weighting

that again stresses on the feature reconstruction losses in higher layers of the pre-trained VGG network. Our goal is to penalize the translated images which differ significantly in structure from the images with which they share the same structure component

, thereby getting to encode explicitly the structure aspect of an image.

(5)

Translation Texture Loss. The translation texture loss further requires that the translated image (respectively, ) should resemble closely in texture the image (respectively, ), since they share the same texture component . In doing so, has to encode explicitly the texture aspect of an image. Inspired by the work of AdaIN [10], we propose a weighted metric to measure channel-wisely the difference in the mean value of their activations extracted from a pre-trained VGG network:

(6)

where is the number of channels in layer of the VGG network, specifies the weighting given to layer , and returns the mean activation of channel . Like the translation structure loss, the translation texture loss also involves the two types of translation:

(7)

where the weighting of the perceptual metric is now chosen to emphasize more on early layers.
Translation Adversarial Loss. In addition to the aforementioned perceptual losses, we also employ adversarial losses to adapt the translated images and to appear as if they were images out of the target and source domains, respectively. To this end, we adopt LSGAN [16] and Patch Discriminator [11].
Label Transfer Loss. The label transfer loss is given by a typical cross-entropy loss that trains supervisedly the common encoder and the classifer on translated images with pseudo labels .

3.3 Implementation

Networks.

For experiments, we use a base model, referring collectively to the common encoder and the pixel-wise classifier , similar to the segmentation network in [24], which is built on DeepLab-v2 [2] with ResNet-101 [7]. We obtain initial weights by pre-training on PASCAL VOC [5] dataset, and at training time, reuse the pre-trained batchnorm layer. The common encoder outputs the feature maps of the last residual layer () as . For the private encoders

, we adopt a convolutional neural network containing 4 convolution blocks, followed by one global pooling layer and one fully-connected layer. The output of the private encoder

(respectively, ) is an 8-dimensional representation (respectively, ). For the shared decoder , we use three residual blocks and three deconvolution layers. The input to the decoder is a concatenation of the private code , the feature maps , and a flag indicating the domain of the private code.

Training Details.

We implement DISE with Pytorch on a single Tesla V100 with 16 GB memory. The full training takes 88 GPU hours. Due to limited memory, at training time, we resize input images to 512

1024 and perform random cropping with a crop size of 256512. However, at test time, the input images are of size 5121024. For fair comparison, we follow Tsai et al[24] and resize the output predictions from 5121024 to 10242048 at evaluation time. We train our model for 250,000 iterations with a batch size of 2. We use the SGD solver with an initial learning rate of for the common encoder and the classifier ; the Adam solver with an initial learning rate of for the decoder ; and the Adam solver with an initial learning rate of for the others. All the learning rates decrease according to the polynomial decay policy. The momentum is set to 0.9 and 0.99.

4 Experimental Results

In this section, we perform experiments on typical datasets for semantic segmentation. We compare the performance of our proposed method with several state-of-the-art baselines and conduct an ablation study to understand the effect of various combinations of loss functions on segmentation performance. The code and pre-trained models are available online111https://github.com/a514514772/DISE-Domain-Invariant-Structure-Extraction.

4.1 Datasets

For experiments, we follow the common protocol adopted by most prior works; that is, taking synthetic dataset GTA5 [17] or SYNTHIA [18] with ground-truth annotations as the source domain, and Cityscapes dataset [4] as the target domain where no annotation is available during training. At test time, the evaluation is conducted on the validation set of Cityscapes. The details of these datasets are described as follows.
Cityscapes [4] is a real-world dataset composed of street-view images captured in 50 different cities. Its data split includes 2975 training images and 500 validation images, with each having a spatial resolution of 2048 1024 and 19 semantic labels at the pixel level. Note again that no ground-truth label is used in model training.
GTA5 [17] is a synthetic dataset containing 24996 images of size 1914 1052. These images are collected from computer game Grand Theft Auto V (GTAV) and come with pixel-level semantic labels that are fully compatible with Cityscapes [4].
SYNTHIA is another synthetic dataset composed of 9400 annotated synthetic images with the resolution 1280 960. Like GTA5, it has semantically compatible annotations with Cityscapes [4]. Following the prior works [9, 20, 24, 26], we use the SYNTHIA-RAND-CITYSCAPE subset [18].

4.2 Performance Comparison

We compare the performance of our method against several baselines, including the models of [3, 9, 19, 20, 24, 26]. Of these, the works [3, 9, 24] are representative of the conventional adaptation that matches distributions of feature or output spaces across domains based on adversarial training; the works [20, 26] are typical of those that map source-domain images to the target domain at the pixel level by image translation or style transfer; and Saleh et al[19] stands out from the others by object detection-based method for foreground instances. More details of these works can be found in Section 2.
GTA5 to Cityscapes. Table 2 shows that as compared to the baselines, our method achieves the state-of-the-art performance of 45.4 in mean intersection-over-union (mIoU). A breakdown analysis further reveals that it outperforms most of the baselines by a large margin in predicting ”Road”, ”Sidewalk, ”Wall”, ”Fence”, ”Building”, and ”Sky” classes. These are classes that often appear concurrently in an image and tend to be spatially connected. Moreover, some of them, e.g. ”Road” and ”Sidewalk”, exhibit highly similar texture appearance. We thus attribute the good performance of our scheme to its ability to filter out the domain-specific texture information in forming a domain-invariant structure representation for semantic segmentation.

In Figure 3, we show qualitative results comparing our method against ”Source Only” (i.e. no adaptation) and ”Conventional Adaptation” (i.e. without disentanglement of structure and texture). For the latter, we present results of [24]. It is clear that the segmentation predictions made by our method look most similar to the ground truths. On closer examination, we see that our model can better discern the difference between ”Sidewalk” and ”Road” as compared to the baselines. It also does a good job at identifying rare classes such as ”Pole” and ”Traffic Sign”. These observations suggest that our structure-based representations are indeed more discriminative than other representations that may have encoded both structure and texture information as with the ”Conventional Adaptation”.
SYNTHIA to Cityscapes. We also evaluate all models on the more challenging SYNTHIA dataset. Specifically, we follow [24] to compare results based on semantic predictions for only 16 classes. Table 3 presents quantitative results in terms of per-class IoU and mIoU. It is seen that most of the aforementioned discussions made with GTA5 dataset can be carried over to SYNTHIA. Although the prior work [9] performs closely to our model in terms of mIoU, the superiority of our method in classes like ”Road”, ”Sidewalk”, ”Building”, ”Sky” still remains.

Methods Base Model

Road

Sidewalk

Building

Wall

Fence

Pole

Traffic Light

Traffic Sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorbike

Bicycle

mIoU

Sankaranarayanan et al. [20] FCN8s [15] 88.0 30.5 78.6 25.2 23.5 16.7 23.5 11.3 78.7 27.2 71.9 51.3 19.5 80.4 19.8 18.3 0.9 20.8 18.4 37.1
Wu et al. [26] FCN8s [15] 88.5 37.4 79.3 24.8 16.5 21.3 26.3 17.4 80.8 30.9 77.6 50.2 19.2 77.7 21.6 27.1 2.7 14.3 18.1 38.5
Hong et al. [9] FCN8s [15] 89.2 49.0 70.7 13.5 10.9 38.5 29.4 33.7 77.9 37.6 65.8 75.1 32.4 77.8 39.2 45.2 0.0 25.5 35.4 44.5
Chen et al. [3] PSPNet [28] 76.3 36.1 69.6 28.6 22.4 28.6 29.3 14.8 82.3 35.3 72.9 54.4 17.8 78.9 27.7 30.3 4.0 24.9 12.6 39.4
Wu et al. [26] PSPNet [28] 85.0 30.8 81.3 25.8 21.2 22.2 25.4 26.6 83.4 36.7 76.2 58.9 24.9 80.7 29.5 42.9 2.5 26.9 11.6 41.7
Chen et al. [3] Deeplab v2 [2] 85.4 31.2 78.6 27.9 22.2 21.9 23.7 11.4 80.7 29.3 68.9 48.5 14.1 78.0 19.1 23.8 9.4 8.3 0.0 35.9
Tsai et al. [24] Deeplab v2 [2] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
Saleh et al. [19] Deeplab v2 [2] 79.8 29.3 77.8 24.2 21.6 6.9 23.5 44.2 80.5 38.0 76.2 52.7 22.2 83.0 32.3 41.3 27.0 19.3 27.7 42.5
Ours Deeplab v2 [2] 91.5 47.5 82.5 31.3 25.6 33.0 33.7 25.8 82.7 28.8 82.7 62.4 30.8 85.2 27.7 34.5 6.4 25.2 24.4 45.4
Table 2: Comparison results on Cityscapes when adapted from GTA5 in terms of per-class IoU and mIoU over 19 classes.
Methods Base Model

Road

Sidewalk

Building

Wall

Fence

Pole

Traffic Light

Traffic Sign

Vegetation

Sky

Person

Rider

Car

Bus

Motorbike

Bicycle

mIoU

Sankaranarayanan et al. [20] FCN8s [15] 80.1 29.1 77.5 2.8 0.4 26.8 11.1 18.0 78.1 76.7 48.2 15.2 70.5 17.4 8.7 16.7 36.1
Wu et al. [26] FCN8s [15] 81.5 33.4 72.4 7.9 0.2 20.0 8.6 10.5 71.0 68.7 51.5 18.7 75.3 22.7 12.8 28.1 36.5
Hong et al. [9] FCN8s [15] 85.0 25.8 73.5 3.4 3.0 31.5 19.5 21.3 67.4 69.4 68.5 25.0 76.5 41.6 17.9 29.5 41.2
Wu et al. [26] PSPNet [28] 82.8 36.4 75.7 5.1 0.1 25.8 8.04 18.7 74.7 76.9 51.1 15.9 77.7 24.8 4.1 37.3 38.4
Chen et al. [3] Deeplab v2 [2] 77.7 30.0 77.5 9.6 0.3 25.8 10.3 15.6 77.6 79.8 44.5 16.6 67.8 14.5 7.0 23.8 36.2
Tsai et al. [24] Deeplab v2 [2] 84.3 42.7 77.5 9.3 0.2 22.9 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 40.0
Ours Deeplab v2 [2] 91.7 53.5 77.1 2.5 0.2 27.1 6.2 7.6 78.4 81.2 55.8 19.2 82.3 30.3 17.1 34.3 41.5
Table 3: Comparison results on Cityscapes when adapted from SYNTHIA in terms of per-class IoU and mIoU over 16 classes.
(a) Target Image (b) Ground Truth (c) Source Only (d) Conventional Adapt. (e) DISE (ours)
Figure 3: Segmentation results on Cityscapes when adapted from GTA5. From left to right, (a) Target Image, (b) Ground Truth, (c) Source Only, (d) Conventional Adaptation [24], (e) and DISE.

4.3 Ablation Study

The following presents a study of four variants of our model by comparing their performance with four distinct training objectives:

  • Source Only: Training with annotated GTA5 dataset [17] by minimizing only, i.e. without any domain adaptation.

  • Seg-map Adaptation: Training with annotated GTA5 dataset [17] together with domain adaptation at the output space by minimizing and . This corresponds to the method in [24], which aligns segmentation predictions across domains.

  • DISE w/o Label Transfer: Training with all loss functions except label transfer loss, i.e. the setting for seg-map adaptation plus disentanglement of structure and texture components.

  • DISE: Training with all loss functions.

Table 4 compares the performance of these settings in terms of mIoU. As expected, without any domain adaptation, ”Source Only” shows the worst performance with a 39.8 mIoU. The performance improves by 2.8 with Seg-map Adaptation”, arriving at a 42.6 mIoU, when introducing domain adaptation at the output space. An even higher gain of 4.3 over ”Source Only” is seen for the setting of ”DISE w/o Label Transfer”, confirming the benefit of disentangling the structure and texture components. Finally, with additional augmented data due to label transfer, the DISE achieves the best performance.

Method A B C D mIoU
Source Only 39.8
Seg-map Adaptation 42.6
DISE w/o Label Transfer 44.1
DISE 45.4
A:
B:
C:
D:
Table 4: Ablation study results on Cityscapes when adapted from GTA5 in terms of mIoU. We present results for no adaptation (Source Only), adaptation at the output space only (Seg-map Adaptation), adaptation at the output space together with structure and texture disentanglement (DISE w/o Label Transfer), and adaptation with all losses considered (DISE).

4.4 Image-to-Image Translation

In Figure 4, we show qualitative results of image-to-image translation with DISE for two settings, S2T and T2S. With S2T (respectively, T2S), we combine the structure content of images in GTA5 (respectively, Cityscapes) in column (a) with the texture appearance of images in Cityscapes (respectively, GTA5) in columns (b) and (d) to produce translated images in columns (c) and (e), respectively. We see that DISE is very effective in translating images from one domain to another with high quality. In all cases, the translated images preserve well the structure content while producing the desired texture appearance. This also validates our use of the ground-truth labels of the source-domain images as pseudo labels for their translated images with texture appearance similar to target-domain images.

S2T

T2S

(a) Structure (b) Texture (c) Output (d) Texture (e) Output
Figure 4: Sample results of translated images. S2T: the structure content of GTA5 images in (a) are combined with the texture appearance of Cityscapes images in (b) and (d) to output translated images in (c) and (e), respectively. T2S: the structure content of Cityscapes images in (a) are combined with the texture appearance of GTA5 images in (b) and (d) to output translated images in (c) and (e), respectively.

5 Conclusion

In this paper, we hypothesize that the high-level structure information of an image is most decisive to semantic segmentation and can be made invariant across domains. Based on this hypothesis, we propose a novel framework, Domain Invariant Structure Extraction (DISE), to disentangle the representation of an image into a domain-invariant structure component and a domain-specific texture component, where the former is used to advance domain adaptation for semantic segmentation. The DISE also allows transfer of ground-truth labels from the source domain to the target domain, providing additional supervision for learning a segmentation network suitable for target-domain images. Extensive simulation results on typical datasets confirms the superiority of DISE over several state-of-the-art methods, justifying our initial hypothesis.

Acknowledgements

This project is supported by MOST-108-2634-F-009-013 and MOST-108-2636-E-009-001 and we are grateful to the National Center for High-performance Computing for computer time and facilities.

References

  • [1] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • [3] Y. Chen, W. Li, and L. Van Gool. Road: Reality oriented adaptation for semantic segmentation of urban scenes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [5] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 2015.
  • [6] Y. Ganin and V. Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In

    Proceedings of the International Conference on Machine Learning (ICML)

    , 2015.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [8] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
  • [9] W. Hong, Z. Wang, M. Yang, and J. Yuan. Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [10] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [13] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [14] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • [15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [16] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [17] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [18] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [19] F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez. Effective use of synthetic data for urban scene semantic segmentation⋆. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [20] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  • [22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [23] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [24] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [25] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [26] Z. Wu, X. Han, Y.-L. Lin, M. G. Uzunbas, T. Goldstein, S. N. Lim, and L. S. Davis. Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [27] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [28] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [29] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [30] X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin. Penalizing top performers: Conservative loss for semantic segmentation adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.