Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation

04/02/2020 ∙ by Zhonghao Wang, et al. ∙ ibm University of Illinois at Urbana-Champaign 3

Learning segmentation from synthetic data and adapting to real data can significantly relieve human efforts in labelling pixel-level masks. A key challenge of this task is how to alleviate the data distribution discrepancy between the source and target domains, i.e. reducing domain shift. The common approach to this problem is to minimize the discrepancy between feature distributions from different domains through adversarial training. However, directly aligning the feature distribution globally cannot guarantee consistency from a local view (i.e. semantic-level), which prevents certain semantic knowledge learned on the source domain from being applied to the target domain. To tackle this issue, we propose a semi-supervised approach named Alleviating Semantic-level Shift (ASS), which can successfully promote the distribution consistency from both global and local views. Specifically, leveraging a small number of labeled data from the target domain, we directly extract semantic-level feature representations from both the source and the target domains by averaging the features corresponding to same categories advised by pixel-level masks. We then feed the produced features to the discriminator to conduct semantic-level adversarial learning, which collaborates with the adversarial learning from the global view to better alleviate the domain shift. We apply our ASS to two domain adaptation tasks, from GTA5 to Cityscapes and from Synthia to Cityscapes. Extensive experiments demonstrate that: (1) ASS can significantly outperform the current unsupervised state-of-the-arts by employing a small number of annotated samples from the target domain; (2) ASS can beat the oracle model trained on the whole target dataset by over 3 points by augmenting the synthetic source data with annotated samples from the target domain without suffering from the prevalent problem of overfitting to the source domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the development and use of deep learning techniques, major progress has been made in semantic segmentation, one of the most crucial computer vision tasks

Chen et al. (2014, 2017a); Zhao et al. (2017); Chen et al. (2017b). However, the current advanced algorithms are often data hungry and require a large amount of pixel-level masks to learn reliable segmentation models. Therefore, one problem arises – annotating pixel-level masks is costly in terms of both time and money. For example, Cityscapes Cordts et al. (2016), a real footage dataset, requires over 7,500 hours of human labor on annotating the semantic segmentation ground truth.

Figure 1: Domain adaptation. (a) Global adaptation. (b) Semantic-level adaptation. (c) Ideal result.

To tackle this issue, unsupervised training methods Chen et al. (2018); Sankaranarayanan et al. (2018); Tsai et al. (2018); Zhang et al. (2018); Wang et al. (2020) were proposed to alleviate the burdensome annotating work. Specifically, images labeled from other similar datasets (source domain) can be utilized to train a model and adapted to the target domain by addressing the domain shift issue. For the semantic segmentation task on Cityscapes dataset specifically, previous works Richter et al. (2016); Ros et al. (2016) have created synthetic datasets which cost little human effort to serve as the source datasets.

While evaluating the previous unsupervised or weakly-supervised methods for semantic segmentation Tsai et al. (2018); Wei et al. (2018, 2017b); Huang et al. (2018); Hou et al. (2018); Wei et al. (2017a), we found that there is still a large performance gap between these solutions and their fully-supervised counterparts. By delving into the unsupervised methods, we observe that the semantic-level features are weakly supervised in the adaptation process and the adversarial learning is only applied on the global feature representations. However, simply aligning the features distribution from global view cannot guarantee consistency in local view, as show in Figure 1

(a), which leads to poor segmentation performance on the target domain. To address this problem, we propose a semi-supervised learning framework – Alleviating Semantic-level Shift (ASS) model – for better promoting the distribution consistency of features from two domains. In particular, ASS not only adapts global features between two domains but also leverages a few labeled images from the target domain to supervise the segmentation task and the semantic-level feature adaptation task. In this way, the model can ease the inter-class confusion problem during the adaptation process (as shown in Figure

1 (b)) and ultimately alleviate the domain shift from local view (as shown in Figure 1 (c)). As a result, our method 1) is much better than the current state-of-the-art unsupervised methods by using a very small amount of the labeled target domain images; 2) addresses the prevalent problem that semi-supervised models typically overfit to the source domain Wang and Deng (2018), and outperforms the oracle model trained with the whole target domain dataset by utilizing the synthetic source dataset and labeled images from the target domain.

2 Related works

Semantic segmentation: the semantic segmentation is a challenging computer vision task. Ever since the surge of deep learning methods, the state of the art of this task has been raised by a large amount. Deeplab Chen et al. (2014, 2017a, 2017b) is such a series of deep learning models that attained top on the 2017 Pascal VOC Everingham et al. (2015) semantic segmentation challenge. It uses Atrous Spatial Pyramid Pooling (ASPP) module which combines multi-rate atrous convolutions and the global pooling technique to enlarge the field of view on the feature map and therefore deepen the model’s understanding of the global semantic context. Deeplab v2 has laconic structure and good performance in extracting images features and can be easily trained. Therefore, it is selected as the backbone network for our work.

Domain adaptation: It is very expensive to collect and annotate datasets for a specific task. However, there are many relative datasets for other tasks available in today’s big data era Wang and Deng (2018). Thus, we can transfer and apply the useful knowledge of the model trained on the off-the-shelf dataset to the target task Ganin et al. (2016). A typical structure for domain adaptation is Generative Adversarial Networks (GAN) Goodfellow et al. (2014). This model trains a generative network and a discriminative network in an adversarial strategy. That is, the discriminative network is updated to distinguish which domain the input feature map is from, while the generative network generates the feature map to fool the discriminative network. The discriminative network will thereby supervise the generative network to minimize the discrepancy of the feature representations from the two domains. As a result, The model can apply the knowledge learned from the source domain to the task on the target domain.

Figure 2: Structure overview. is the number of classes for adaptation. and are the width and height of the input image respectively. is the number of feature channels of the feature map.

3 Method: Alleviating Semantic-level Shift

3.1 Overview of the Model Structure

We randomly select a subset of images from the target domain with ground truth annotations, and denote this set of images as {}. We denote the whole set of source images and the set of unlabeled target images as {} and {} respectively. As shown in Figure 2, our domain adaptation structure has four modules: the feature generation module , the segmentation classification module , the global feature adaptation module and the semantic-level adaptation module . We denote the output feature maps of by , the ground truth label maps by and the downsampled label maps (of the same height and width as ) as . We use , to denote the height and width of the input image, , to denote those of , and , to denote those of the confidence maps output by the discriminator of . is the class set, is the number of classes, and is the channel number of .

When testing the model, we forward the input image to and use to operate on to predict the semantic class that each pixel belongs to. The following sections will introduce the details of each module.

3.2 Segmentation

We forward to a convolutional layer to output the score maps with

channels. Then, we use a bilinear interpolation to upsample the score maps to the original input image size and apply a softmax operation channel-wisely to get score maps

. The segmentation loss is calculated as

(1)

3.3 Global Feature Adaptation Module

This module adapts from the source domain to the target domain. we input the source image score maps to the discriminator of to conduct the adversarial training. We define the adversarial loss as:

(2)

We define as the source domain pixel and as the target domain pixel for the output of . Therefore, this loss will force to generate features closer to the target domain globally. To train , we forward and to in sequence. The loss of is calculated as:

(3)

where if the feature maps are from the source domain and if the feature maps are from the target domain.

Figure 3:

Feature vector generation for the fully connected class feature adaptation. This example generates the feature vector for the class road. We first resize the ground truth label maps to the size of the feature maps; then, align them with the feature maps and find the locations of the class feature vectors shown in red color; crop these feature vectors and average across the height and width to get the averaged feature vector of a certain class.

3.4 Semantic-level adaptation module

This module adapts the feature representation for each class in the source domain to the corresponding class feature representation in the target domain to alleviate the domain shift from semantic-level.

3.4.1 Fully connected semantic-level adaptation

We believe that the feature representation for a specific class at each pixel should be close to each other. Thereby, we can average these feature vectors across the height and width to represent the semantic-level feature distribution, and adapt the averaged feature vectors to minimize the distribution discrepancy between two domains. As shown in Figure 3, the semantic-level feature vector of class is calculated as

(4)

where , . Then we forward these semantic-level feature vectors to the semantic-level feature discriminator for adaptation, as shown in Figure 2. only has 2 fully connected layers, and outputs a vector of channels after a softmax operation. The first half and the last half channels correspond to classes from the source domain and the target domain respectively. Therefore, the adversarial loss can be calculated as

(5)

To train

, we let it classify the semantic-level feature vector to the correct class and domain. The loss of

can be calculated as:

(6)

where if the feature vector is from the source domain and if it is from the target domain.

3.4.2 CNN semantic-level feature adaptation

We observe that it is hard to extract the semantic-level feature vectors, because we have to use the label maps to filter pixel locations and generate the vectors in sequence. Therefore, inspired from the previous design, we come up with a laconic CNN semantic-level feature adaptation module. The discriminator uses convolution layers with kernel size , which acts as using the fully connected discriminator to operate on each pixel of , as shown in Figure 2. The output has channels after a softmax operation where the first half and the last half correspond to the source domain and the target domain respectively. Then, the adversarial loss can be calculated as:

(7)

where is the pixel ground truth class. To train the discriminator, we can use the loss as follows:

(8)

3.5 Adversarial learning procedure

Our ultimate goal for is to have a good semantic segmentation ability by adapting features from the source domain to the target domain. Therefore, the training objective for can derive from Eqn (1) as

(9)

where is the weight parameter. The two discriminators should be able to distinguish which domain the feature maps are from, which enables the features to be adapted in the right direction. We can simply sum up the two discriminator losses as the training objective for discriminative modules.

(10)

In summary, we will optimize the following min-max criterion to let our model perform better in segmentation task by adapting the features extracted from the source domain more alike the ones extracted from the target domain.

(11)

4 Implementation

4.1 Network Architecture

Segmentation Network. We adopt ResNet-101 model He et al. (2016)

pre-trained on ImageNet

Deng et al. (2009) with only the 5 convolutional layers {, , , , } as the backbone network. Due to memory limit, we do not use the multi-scale fusion strategy Yu and Koltun (2016). For generating better-quality feature maps, we follow the common practice from Chen et al. (2014); Yu and Koltun (2016); Tsai et al. (2018) and twice the resolution of the feature maps of the final two layers. To enlarge the field of view, we use dilated convolutional layers Yu and Koltun (2016)

with stride 2 and 4 in

and . For the classification heads, we apply two ASPP modules Chen et al. (2017a) to with and with respectively. With upsampling the output of and a softmax operation, our backbone model achieves 65.91 mIoU scores with a fully-supervised scheme which is similar to the performance reported in Tsai et al. (2018).

Global Discriminator. Following Tsai et al. (2018), We use 5 convolutional layers with kernel size 4

4, stride of 2 and channel number of {64, 128, 256, 512, 1} respectively to form the network. We use a leaky ReLU

L. Maas et al. (2013)

layer of 0.2 negative slope between adjacent convolutional layers. Due to the small batch size in the training process, we do not use batch normalization layers

Ioffe and Szegedy (2015). Two independent global discriminators are used for with and with .

Fully Connected semantic-level Discriminator. We use two fully connected layers with channel number of 1024 and twice the class number respectively, and we put a Leaky ReLU L. Maas et al. (2013) of 0.2 negative slope between them.

CNN Semantic-level Discriminator. We use two convolutional layers with the kernel size of 11, stride of 1 and channel number of 1024 and twice the class number respectively. We insert a Leaky ReLU L. Maas et al. (2013) layer with 0.2 negative slope between the two convolutional layers.

4.2 Network Training

We train the network with two stages. First, we train the segmentation network with until convergence. Then, we use the to refine the training process. At the first stage, for one iteration, we forward and to and optimize ; then, we pass generated to to optimize ; After that, we forward to to generate ; we finally use along with to optimize . At the second stage, besides all the training strategy in the first stage, when forwarding and to , we also optimize for ; at the final step, we also use and to optimize for .

We use Pytorch toolbox and a single NVIDIA V100 GPU with 16 GB memory to train our network. Stochastic Gradient Descent (SGD) is used to optimize the segmentation network. We use Nesterov’s method

Botev et al. (2017) with momentum 0.9 and weight decay to accelerate the convergence. Following Chen et al. (2014), we set the initial learning rate to be and let it polynomially decay with the power of 0.9. We use Adam optimizer Kingma and Ba (2014) with momentum 0.9 and 0.99 for all the discriminator networks. We set the initial learning rate to be and follow the same polynomial decay rule.

GTA5 Cityscapes
Method

road

sidewalk

building

wall

fence

pole

light

sign

vegetation

terrain

sky

person

rider

car

truck

bus

train

motorbike

bike

mIoU
Wu et al.Wu et al. (2018) 85.0 30.8 81.3 25.8 21.2 22.2 25.4 26.6 83.4 36.7 76.2 58.9 24.9 80.7 29.5 42.9 2.5 26.9 11.6 41.7
Tsai et al.Tsai et al. (2018) 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
Saleh et al.Saleh et al. (2018) 79.8 29.3 77.8 24.2 21.6 6.9 23.5 44.2 80.5 38.0 76.2 52.7 22.2 83.0 32.3 41.3 27.0 19.3 27.7 42.5
Hong et al.Hong et al. (2018) 89.2 49.0 70.7 13.5 10.9 38.5 29.4 33.7 77.9 37.6 65.8 75.1 32.4 77.8 39.2 45.2 0.0 25.5 35.4 44.5
oracle wholecity 96.7 75.7 88.3 46.0 41.7 42.6 47.9 62.7 88.8 53.5 90.6 69.1 49.7 91.6 71.0 73.6 45.3 52.0 65.5 65.9
ours+50city 94.3 63.0 84.5 26.8 28.0 38.4 35.5 48.7 87.1 39.2 88.8 62.2 16.3 87.6 23.2 39.2 7.2 24.4 58.1 50.1
ours+100city 96.0 71.7 85.9 27.9 27.6 42.8 44.7 55.9 87.7 46.9 89.0 66.0 36.4 88.4 28.9 21.4 11.4 38.0 63.2 54.2
ours+200city 96.1 71.9 85.8 28.4 29.8 42.5 45.0 56.2 87.4 45.0 88.7 65.8 38.2 89.6 42.2 35.9 17.1 35.8 61.6 56.0
ours+500city 96.2 72.7 87.6 35.1 31.7 46.6 46.9 62.7 88.7 49.6 90.5 69.2 42.7 91.1 52.6 60.9 9.6 43.1 65.6 60.2
ours+1000city 96.8 76.3 88.5 30.5 41.7 46.5 51.3 64.3 89.1 54.2 91.0 70.7 48.7 91.6 59.9 68.0 40.8 48.0 67.0 64.5
ours+2975city(all) 97.3 79.3 89.8 47.4 49.7 48.9 52.9 67.4 89.7 56.3 91.9 72.2 53.1 92.6 69.3 78.4 58.0 51.2 68.2 69.1
Table 1: Results of adapting GTA5 to Cityscapes. The first four rows show the performance of the current state-of-the-art unsupervised algorithms. The following row shows the performance of our segmentation network trained on the whole Cityscapes dataset. The last six rows show the performance of our models trained with different number of Cityscapes labeled images.
# of Cityscapes images Oracle GA GA+FCSA GA+CSA Improvements
0 - 42.37 - - -
50 39.47 49.98 50.16 50.13 +10.66
100 43.55 53.45 54.11 54.21 +10.66
200 47.12 54.37 56.35 55.95 +8.83
500 53.58 56.45 59.89 60.16 +6.58
1000 58.61 57.96 63.80 64.47 +5.86
2975 (all) 65.91 59.71 68.84 69.14 +3.23
Table 2: GTA5 Cityscapes: performance contributions of adaptation modules. The oracle model is only trained with the given number of Cityscapes labeled images. FCSA: Fully connected semantic-level feature adaptation module. CSA: CNN semantic-level feature adaptation module. We present the improvements of the model over the oracle model.
# of Cityscapes images sourcetarget sourcetarget sourcetarget
50 49.98 45.80 44.50
100 53.45 49.92 49.09
200 54.37 52.85 51.82
500 56.45 55.78 52.10
Table 3: Global adaptation direction comparison
# of Cityscapes images
100 54.11 53.87 53.68 53.96 53.92
500 59.76 59.29 59.89 59.74 59.69
(a): for fully connected semantic-level adaptation module
# of Cityscapes images
500 59.76 59.46 60.16 59.67
(b): for CNN semantic-level adaptation module
Table 4: parameters analysis
Figure 4: (a) image; (b) ground truth; (c) oracle model trained with the whole Cityscapes dataset; (d) unsupervised; (e) ours+200city; (f) ours+1000city; (g) ours+wholecity

5 Experiments

We validate the effectiveness of our proposed method by transferring our model trained with synthetic datasets (i.e., GTA5 Richter et al. (2016) and SYNTHIA Ros et al. (2016)) in a semi-supervised setting to test on the real-world image dataset Cityscapes Cordts et al. (2016).

5.1 Datasets

The Cityscapes Cordts et al. (2016) dataset consists of 5000 images of resolution with high-quality pixel-level annotations. These images of street scenes were annotated with 19 semantic labels for evaluation. This dataset is split into training, validation and test sets with 2975, 500 and 1525 images respectively. Following previous works Hoffman et al. (2016); Peng et al. (2017), We only evaluate our models on the validation set. The GTA5 Richter et al. (2016) dataset contains 24966 fine annotated synthetic images of resolution . All the images are frames captured from the game Grand Theft Auto V. This dataset shares all the 19 classes used for evaluation in common with the Cityscapes dataset. The SYNTHIA Ros et al. (2016) dataset has 9400 images of resolution with pixel-level annotations. Similar to Chen et al. (2017c); Tsai et al. (2018), we evaluate our models on Cityscapes validation set with the 13 classes shared in common between SYNTHIA dataset and Cityscapes dataset.

5.2 Gta5

We present the result of adapting our model from GTA5 dataset to Cityscapes dataset in Table 1. For fairness, We compare our results with the current state-of-the-art unsupervised algorithms using ResNet-101 He et al. (2016) as the backbone structure. As a result, our semi-supervised model trained with a few labeled Cityscapes data (e.g. 50 images) can beat all the unsupervised models; if trained with 1000 labeled Cityscapes images, it can achieve a performance close to that of the oracle model; if trained with the whole labeled Cityscapes dataset, it outperforms the oracle model by 3.2 points of mIoU. This result addresses the issue that a semi-supervised model easily overfits to the source domain dataset Wang and Deng (2018). We provide some visualization results in Figure 4.

Then, we do ablation studies on the performance contribution of each adaptation module. The result is shown in Table 2. First, the contribution of module disappears or is negative when the labeled Cityscapes images reach a number of 1000 or more. Compared to the the model with only module, the models and both have on-par improvements if trained with over 50 Cityscapes labeled images. Due to a simpler structure, we report the performance improvements of the model over the oracle model in Table 2, and this model is recommended for the semantic-level adaptation.

We conjecture that the module adaptation direction should follow the one, or it would lead to a bad performance. Therefore, we experiment on which adaptation direction the model with module can achieve the optimal result. As shown in Table 3, the model adapted from the source domain to the target domain always achieves the best performance. Therefore, we let all of our adaptation modules follow this adaptation direction.

In Table 4, we analyse how the selection of affects the performance of our model. Due to our two-stage training method, the first stage serves as a warm-up procedure for the second one. Thus, either the module or the module is insensitive to the selection of . For the best performance, we choose to be 1 for the module and 0.01 for module.

5.3 Synthia

We present the result of adapting our model from SYNTHIA to Cityscapes in Table 5. Again, our model with a few labeled Cityscapes images can beat the unsupervised methods with a large margin; it can compete with the oracle model with around 1000 labeled Cityscapes images; it outperforms the oracle model by 3.3 points in mIoU with the whole labeled Cityscapes images.

We conduct ablation studies on the adaptation module contributions as shown in Table 6. The contribution of module decreases as the number of supervised Cityscapes images increases. The model is worse than the model with only module with fewer than 500 labeled Cityscapes images, but it is better with no fewer than 500 labeled Cityscapes images. We conjecture that if the randomly selected labeled Cityscapes images cannot represent the whole Cityscapes semantic-level feature distribution, the module cannot adapt the semantic-level features in an appropriate direction, and thus leads to an inferior performance. Further researches can be done to address how to select target domain labeled images which makes the adaptation more meaningful.

SYNTHIA Cityscapes
Method

road

sidewalk

building

light

sign

vegetation

sky

person

rider

car

bus

motorbike

bike

mIoU
Wu et al.Wu et al. (2018) 81.5 33.4 72.4 8.6 10.5 71.0 68.7 51.5 18.7 75.3 22.7 12.8 28.1 42.7
Tsai et al.Tsai et al. (2018) 84.3 42.7 77.5 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 46.7
Hong et al.Hong et al. (2018) 85.0 25.8 73.5 19.5 21.3 67.4 69.4 68.5 25.0 76.5 41.6 17.9 29.5 47.8
oracle wholecity 96.5 77.6 91.2 49.0 62.1 91.4 90.2 70.1 47.7 91.8 74.9 50.1 66.9 73.8
ours+50city 94.1 63.9 87.6 18.1 37.1 87.5 89.7 64.6 37.0 87.4 38.6 23.2 59.6 60.7
ours+100city 93.6 64.6 88.8 30.4 43.3 89.0 89.2 65.3 25.1 88.2 47.4 23.8 59.2 62.1
ours+200city 95.2 71.2 89.1 32.9 46.4 89.1 90.3 67.0 31.6 89.4 42.3 32.9 64.8 64.8
ours+500city 96.7 77.1 91.0 42.6 62.2 91.2 91.3 69.5 34.8 91.6 56.3 35.0 68.0 69.8
ours+1000city 97.2 80.6 91.3 46.3 66.4 91.5 91.4 71.6 45.0 92.2 61.5 45.3 68.4 73.0
ours+2975city(all) 97.5 82.7 92.4 53.3 69.7 92.2 92.8 73.6 52.4 93.7 79.3 50.9 71.5 77.1
Table 5: Results of adapting SYNTHIA to Cityscapes. The first three rows show the performance of the current state-of-the-art unsupervised algorithms. The following row shows the performance of our segmentation network trained on the whole Cityscapes dataset. The last six rows show the performance of our models trained with different number of Cityscapes labeled images.
# of Cityscapes images Oracle GA GA+CSA Improvements
0 - 46.74 - -
50 52.62 60.65 57.37 +8.03
100 57.59 62.14 58.34 +4.55
200 60.84 64.79 64.47 +3.95
500 66.54 69.07 69.79 +3.25
1000 70.70 71.80 72.97 +2.27
2975 (all) 73.80 74.97 77.08 +3.28
Table 6: SYNTHIA Cityscapes: performance contributions of adaptation modules. We present the improvements of the better model between and over the oracle model.

6 Conclusion

This paper proposes a semi-supervised learning framework to adapt the global feature and the semantic-level feature from the source domain to the target domain for the semantic segmentation task. As a result, with a few labeled target images, our model outperforms current state-of-the-art unsupervised models by a great margin. Our model can also beat the oracle model trained on the whole dataset from target domain by utilizing the synthetic data with the whole target domain labeled images without suffering from the prevalent problem of overfitting to the source domain.

References

  • [1] A. Botev, G. Lever, and D. Barber (2017) Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. In IEEE IJCNN, pp. 1899–1903. Cited by: §4.2.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1, §2, §4.1, §4.2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40 (4), pp. 834–848. Cited by: §1, §2, §4.1.
  • [4] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1, §2.
  • [5] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. Frank Wang, and M. Sun (2017) No more discrimination: cross city adaptation of road scene segmenters. In IEEE ICCV, Cited by: §5.1.
  • [6] Y. Chen, W. Li, and L. Van Gool (2018) ROAD: reality oriented adaptation for semantic segmentation of urban scenes. In IEEE CVPR, Cited by: §1.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In IEEE CVPR, Cited by: §1, §5.1, §5.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE CVPR, Cited by: §4.1.
  • [9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. IJCV 111 (1), pp. 98–136. External Links: ISSN 1573-1405, Document, Link Cited by: §2.
  • [10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016)

    Domain-adversarial training of neural networks

    .
    JMLR 17 (1), pp. 2096–2030. External Links: ISSN 1532-4435 Cited by: §2.
  • [11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Networks. NIPS. External Links: 1406.2661 Cited by: §2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE CVPR, Cited by: §4.1, §5.2.
  • [13] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §5.1.
  • [14] W. Hong, Z. Wang, M. Yang, and J. Yuan (2018) Conditional generative adversarial network for structured domain adaptation. In IEEE CVPR, Cited by: Table 1, Table 5.
  • [15] Q. Hou, P. Jiang, Y. Wei, and M. Cheng (2018) Self-erasing network for integral object attention. In NIPS, Cited by: §1.
  • [16] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In IEEE CVPR, Cited by: §1.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, External Links: Link Cited by: §4.1.
  • [18] D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §4.2.
  • [19] A. L. Maas, A. Y Hannun, and A. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML, Cited by: §4.1, §4.1, §4.1.
  • [20] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2017) VisDA: The Visual Domain Adaptation Challenge. arXiv preprint arXiv:1710.06924. Cited by: §5.1.
  • [21] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV, External Links: Link, 1608.02192 Cited by: §1, §5.1, §5.
  • [22] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez (2016) The SYNTHIA Dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In IEEE CVPR, Cited by: §1, §5.1, §5.
  • [23] F. Saleh, S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez (2018) Effective use of synthetic data for urban scene semantic segmentation. In ECCV, External Links: ISBN 978-3-030-01215-1, Document Cited by: Table 1.
  • [24] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Lim, and R. Chellappa (2018) Unsupervised domain adaptation for semantic segmentation with gans. In IEEE CVPR, External Links: Link, 1711.06969 Cited by: §1.
  • [25] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In IEEE CVPR, External Links: Link, 1802.10349 Cited by: §1, §1, §4.1, §4.1, Table 1, §5.1, Table 5.
  • [26] M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §1, §2, §5.2.
  • [27] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020) Differential treatment for stuff and things: a simple unsupervised domain adaptation method for semantic segmentation. arXiv preprint arXiv:2003.08040. External Links: 2003.08040 Cited by: §1.
  • [28] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In IEEE CVPR, Cited by: §1.
  • [29] Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, J. Feng, Y. Zhao, and S. Yan (2017) STC: a simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI 39 (11), pp. 2314–2320. External Links: Document, ISSN 0162-8828 Cited by: §1.
  • [30] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang (2018) Revisiting dilated convolution: a simple approach for weakly- and semi-supervised semantic segmentation. In IEEE CVPR, Cited by: §1.
  • [31] Z. Wu, X. Han, Y. Lin, M. G. Uzunbas, T. Goldstein, S. Lim, and L. S. Davis (2018) DCAN: dual channel-wise alignment networks for unsupervised scene adaptation. In ECCV, External Links: Link, 1804.05827 Cited by: Table 1, Table 5.
  • [32] F. Yu and V. Koltun (2016) Multi-Scale Context Aggregation by Dilated Convolutions. In ICLR, External Links: 1511.07122 Cited by: §4.1.
  • [33] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei (2018) Fully convolutional adaptation networks for semantic segmentation. In IEEE CVPR, Cited by: §1.
  • [34] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE CVPR, Cited by: §1.