Plugging Self-Supervised Monocular Depth into Unsupervised Domain Adaptation for Semantic Segmentation

10/13/2021 ∙ by Adriano Cardace, et al. ∙ University of Bologna 0

Although recent semantic segmentation methods have made remarkable progress, they still rely on large amounts of annotated training data, which are often infeasible to collect in the autonomous driving scenario. Previous works usually tackle this issue with Unsupervised Domain Adaptation (UDA), which entails training a network on synthetic images and applying the model to real ones while minimizing the discrepancy between the two domains. Yet, these techniques do not consider additional information that may be obtained from other tasks. Differently, we propose to exploit self-supervised monocular depth estimation to improve UDA for semantic segmentation. On one hand, we deploy depth to realize a plug-in component which can inject complementary geometric cues into any existing UDA method. We further rely on depth to generate a large and varied set of samples to Self-Train the final model. Our whole proposal allows for achieving state-of-the-art performance (58.8 mIoU) in the GTA5->CS benchmark benchmark. Code is available at https://github.com/CVLAB-Unibo/d4-dbst.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 17

page 18

page 19

page 20

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is the task of classifying each pixel of an image. Nowadays, Convolutional Neural Networks can achieve impressive results in this task but require huge quantities of labelled images at training time

[44, 3, 34, 41]. A popular trend to address this issue concerns leveraging on computer graphics simulations [42] or game engines [40] to obtain automatically synthetic images endowed with per-pixel semantic labels. Yet, a network trained on synthetic data only will perform poorly in real environments due to the so called domain-shift problem. In the last few years, many Unsupervised Domain Adaptation (UDA) techniques aimed at alleviating the domain-shift problem have been proposed in literature. These approaches try to minimize the gap between the labeled source domain (e.g. synthetic images) and the unlabeled target domain (e.g. real images) by either hallucinating input images, manipulating the learned features space or imposing statistical constraints on the predictions [58, 8, 66, 18].

Figure 1: D4 can be plugged seamlessly into any existing method to improve UDA for Semantic Segmentation. Here we show how the introduction of D4 can ameliorate the performance of two recent methods like LTIR [22] and Stuff and Things [55].

At a more abstract level, UDA may be thought of as the process of transferring more effectively to the target domain the knowledge from a task solved in the source domain. This suggests that it may be possible to improve UDA by transferring also knowledge learned from another task to improve performance in the real domain. In fact, the existence of tightly related representations within CNNs trained for different tasks has been highlighted since the early works in the field [60], and it is nowadays standard practice to initialize CNNs deployed for a variety of diverse tasks, such as, e.g., object detection [46], semantic segmentation [4] and monocular depth estimation [14]

, with weights learned on Imagenet Classification

[11]. The notion of transferability of representations among CNNs trained to solve different visual tasks has been formalized computationally by the Taskonomy proposed in [63]. Later, [38]

has shown that it is possible to train a CNN to hallucinate deep features learned to address one task into features amenable to another task related to the former.

Inspired by these findings, we argue that monocular depth estimation could be an excellent task in order to gather additional knowledge useful to address semantic segmentation in UDA settings. First of all, a monocular depth estimation network makes predictions based on 3D cues dealing with the appearance, shape, relative sizes and spatial relationships of the stuff and things observed in the training images. This suggests that the network has to predict geometry by implicitly learning to understand the semantics of the scene. Indeed, [37, 21, 24, 15] show that a monocular depth estimation network obtains better performances if forced to learn jointly a semantic segmentation task. We argue, though, the correlation between geometry and semantics to hold bidirectionally, such that a semantic segmentation network may obtain useful hints from depth information. This intuition is supported by [38], which shows that it is possible to learn a mapping in both directions between features learned to predict depth and per-pixel semantic labels. It is also worth observing how depth prediction networks tend to extract accurate information for regions characterized by repeatable and simple geometries, such as roads and buildings, which feature strong spatial and geometric priors (e.g. the road is typically a plane in the bottom part of the image) [13, 14, 47, 57]. Therefore, on one hand predicting accurately the semantics of such regions from depth information alone should be possible. On the other, a semantic network capable of reasoning on the geometry of the scene should be less prone to mistakes caused by appearance variations between synthetic and real images, the key issue in UDA for semantic segmentation.

Despite the above observations, injection of geometric cues into UDA frameworks for semantic segmentation has been largely unexplored in literature, with the exception of a few proposals, which either assume availability of depth labels in the real domain [56], a very restrictive assumption, or can leverage on depth information only in the synthetic domain due to availability of cheap labels [52, 27, 6]. In this respect, we set forth an additional consideration: nowadays effective self-supervised procedures allow for training a monocular depth estimation network without the need of ground-truth labels [14, 12, 70].

Based on the above intuitions and considerations, in this paper we propose the first approach that, thanks to self-supervision, allows for deploying depth information from both synthetic and unlabelled real images in order to inject geometric cues in UDA for semantic segmentation. Purposely, we adapt the knowledge learned to pursue depth estimation into a representation amenable to semantic segmentation by the feature transfer architecture proposed in [38]. As the geometric cues learned from monocular images yield semantic predictions that are often complementary to those attainable by current UDA methods, we realize our proposal as a depth-based add-on, dubbed D4 (Depth For), which can be plugged seamlessly into any UDA method to boost its performances, as illustrated in Fig. 1.

A recent trend in UDA for semantic segmentation is Self-Training (ST), which consists in further fine-tuning the trained network by its own predictions [72, 73, 69, 29, 33, 30]. We propose a novel Depth-Based Self-Training (DBST) approach which deploys once more the availability of depth information for real images in order to build a large and varied dataset of plausible samples to be deployed in the Self-Training (ST) procedure111See also [19] for concurrent work that proposes a similar idea..

Our framework can improve many state-of-the-art methods by a large margin in two UDA for semantic segmentation benchmarks, where networks are trained either on GTA5 [40] or SYNTHIA VIDEO SEQUENCES [42]

and tested on Cityscapes 

[10]. Moreover, we show that our DBST procedure enables to distill the whole framework into a single ResNet101 [16] and achieve state-of-the-art performance. Our contributions can be summarized as follows:

  • We are the first to show how to exploit self-supervised monocular depth estimation on real images to pursue semantic segmentation in a domain adaptation settings.

  • We propose a depth-based module (D4) which can be plugged into any UDA for semantic segmentation method to boost performance.

  • We introduce a new protocol (DBST) that exploits depth predictions to synthesize augmented training samples for the final self-training step deployed oftentimes in UDA for semantic segmentation pipelines.

  • We show that leveraging on both D4 and DBST allows for achieving 58.8 mIoU in the popular GTA5CS UDA benchmark, i.e., to the best of our knowledge, the new state-of-the-art.

2 Related Work

Domain Adaptation. Domain Adaptation is a promising way of solving semantic segmentation without annotations. Pioneering works [17, 58, 2, 9, 31, 67, 62, 28, 22] rely on CycleGANs [71] to convert source data into the style of target data, reducing the low-level visual appearance discrepancy among domains. Other works exploit adversarial training to enforce domain alignment [49, 50, 54, 65, 59, 36, 1, 53]. [55] extended this idea by aligning differently objects with low and high variability in terms of appearance. Few works tried to exploit depth information to boost UDA for semantic segmentation. [52], for example, proposes a unified depth-aware UDA framework that leverages the knowledge of depth maps in the source domain to perform feature space alignment. [43] extends this idea by modelling explicitly the relation between different visual semantic classes and depth ranges. [7], instead, considers depth as a way to obtain adaptation at both the input and output level. [56] is the first work to consider depth in the target domain, although assuming supervision to be available. Conversely, we show how to deploy depth in the target domain without availability of ground-truth depths.

Self-Training. More recently, a new line of research focuses on self-training [26], where a semantic classifier is fine-tuned directly on the target domain, using its own predictions as pseudo-labels. [72, 73, 30] cleverly set class-confidence thresholds to mask wrong predictions. [68, 33, 69] propose to use pseudo-labels with different regularization techniques to minimize both the inter-domain and intra-domain gap. [64] instead, estimates the likelihood of pseudo-labels to perform online correction and denoising during training. Differentely, [48] synthesizes new samples for the target domain by cropping objects from source images using ground truth labels and pasting them onto target images. Inspired by this work, we propose a novel algorithm for generating new samples to perform self-training on the target domain. In contrast to [48], our strategy is applied to target images only and relies on the availability of depth maps obtained through self-supervision.

Task Adaptation. All existing approaches tackle independently task adaptation or domain adaptation. [51] was the first paper to propose a cross-tasks and cross-domains adaptation approach, considering two image classification problems as different tasks. UM-Adapt [25] employs a cross-task distillation module to force inter-task coherency. Differently, [38], directly exploits the relationship among tasks to reduce the need for labelled data. This is done by learning a mapping function in feature space between two networks trained independently for two separate tasks, a pretext and target one. We leverage on this intuition but, unlike [38], our approach does not require supervision to solve the pretext task in the target domain.

Figure 2: From left to right: ground truth, semantics from depth, semantics by LTIR [22]. The semantic labels predicted from depth are more accurate than those yielded by UDA methods in regularly-shaped objects (such as the wall in the top image and the sidewalk in the bottom one), whilst UDA approaches tend to perform better on small objects (see the traffic signs in both images).
Figure 3: Overview of our proposal. RGB images are first processed by two different segmentation engines to produce complementary predictions that are then combined by a weighted sum which accounts for the relative strengths of the two engines (Eq. 3). During the next step (DBST), predictions from D4-UDA are used to synthesize augmented samples by mixing portions of different images according to depth and semantics. The augmented samples are used to train a final model, so as to distill the whole pipeline into a single network.

3 Method

In Unsupervised Domain Adaptation (UDA) for semantic segmentation one wishes to solve semantic segmentation in a target domain, , though labels are available only in another domain, referred to as source domain . In the following we describe the two ingredients of our proposal to better tackle this problem. In Sec. 3.1 we show how to transfer information from self-supervised monocular depth to semantic segmentation and merge this knowledge with any UDA method (D4-UDA, Depth For UDA). Then, in Sec. 3.2 we introduce a Depth-Based Self-Training strategy (DBST) to further improve semantic predictions while distilling the whole framework into a single CNN.

3.1 D4 (Depth For UDA)

Semantics from Depth. The main intuition behind our work is that semantic segmentation masks obtained exploiting depth information have peculiar properties that make them suitable to improve segmentation masks obtained with standard UDA methods. However, predicting semantics from depth is an arduous task. Indeed, we experiment several alternatives (see Sec. 4.4 Alternative strategies to exploit depth) and find out that the most effective way is a procedure similar to the one proposed in [38], which we adapt to the UDA scenario. The pipeline works as follows: train one CNN to solve a first task on and , train another CNN to solve a second task on only (i.e. the only domain where ground truth labels for the second task are available) and, finally, train a transfer

function to map deep features extracted by the first CNN into deep features amenable to the second one. As the second CNN has been trained only on

, also the transfer function can be trained only on but, interestingly, it can generalize to . As a consequence, at inference time one can solve the second task in based on the features transferred from the first task. We refer to [38] for further details.

Hence, if we assume the first and second task to consist in depth estimation and semantic segmentation, respectively, the idea of transferring features might be deployed in a UDA scenario since it gives the possibility to solve the second task on without the need of ground truth labels. However, the learning framework delineated in [38] assumes availability of ground-truth labels for the first task (depth estimation in our setting) also in (real images). As pointed out in Sec. 1, this assumption does not comply with the standard UDA for semantic segmentation problem formulation, which requires availability of semantic labels for source images () alongside with unlabelled target images only (). To address this issue we propose to rely on depth proxy-labels attainable from images belonging to both and

without the need of any ground-truth information. In particular, we propose to deploy one of the recently proposed deep neural networks, such as

[14], that can be trained to perform monocular depth estimation based on a self-supervised loss that requires availability of raw image sequences only, i.e. without ground-truth depth labels. Thus, in our method we introduce the following protocol. First, we train a self-supervised monocular depth estimation network on both and . Then, we use this network to generate depth proxy-labels for both domains. We point out that we use such network as an off-the-shelf algorithm without the aim of improving depth estimation. Finally, according to [38], we train a first CNN to predict depth from images on both domains by the previously computed depth proxy-labels, a second CNN to predict semantic labels on and a transfer network which allows for predicting semantic labels from depth features in . In the following, we will refer to such predictions as “semantics from depth” because they concern semantic information extracted from features amenable to perform monocular depth estimation.

Combine with UDA. Fig. 2 compares semantic predictions obtained from depth by the protocol described in the previous sub-section and from a recent UDA method. The reader may observe a clear pattern: predictions from depth tend to be smoother and more accurate on objects with large and regular shapes, like road, sidewalk, wall and building. However, they turn out often imprecise in regions where depth predictions are less informative, like thin things partially overlapped with other objects or fine-grained structures in the background. As UDA methods tend to perform better on such classes (see Fig. 2), our D4 approach is designed to combine the semantic knowledge extracted from depth with that provided by any chosen UDA method in order to achieve more accurate semantic predictions. Depth information helps on large objects with regular shapes, which usually account for the majority of pixels in an image. On the contrary, UDA methods perform well in predicting semantic labels for categories that typically concern much smaller fractions of the total number of pixel in an image, like e.g. the traffic signs in Fig. 2. This orthogonality suggests that a simple yet effective way to combine the semantic knowledge drawn from depth with that provided by UDA methods consists in a weighted sum of predictions, with weights computed according to the frequency of classes in (the domain where semantic labels are available). As weights given to UDA predictions () should be larger for rarer classes, they can be computed as:

(1)

where denotes the number of classes and denotes their frequencies at the pixel level, the ratio between the number of pixels labelled with class in and the total number of labelled pixels in . Eq. 1 is the standard formulation introduced in [34] to compute bounded weights inversely proportional to the frequency of classes. We set in Eq. 1 to 1.02, akin to [34].

Accordingly, weights applied to semantic predictions drawn from depth () are given by:

(2)

Thus, at each pixel of a given image we propose to combine semantics from depth and predictions yielded by any chosen UDA method as follows:

(3)

where is the final prediction, and

are the logits associated with semantics from depth and the selected UDA method, respectively,

denotes the softmax function with a temperature term that we set to 6 in our experiments.

As illustrated in Fig. 3, the formulation presented in Eq. 3 and symbolized as can be used seamlessly to plug semantics obtained from self-supervised monocular depth into any existing UDA method. We will refer to the combination of a given UDA method with our D4 with the expression D4-UDA. Experimental results (Sec. 4.3) show that all recent s.o.t.a. UDA methods do benefit significantly from the complementary geometric cues brought in by D4.

Figure 4: The rightmost column is synthesized by copying pixels from the left column into the central one. Pixels are chosen according to their semantic class (second row) and stacked according to their depths (third row). The white pixels in the depth maps represent areas too far from the camera that cannot be selected.

3.2 DBST (Depth-Based Self-Training)

We describe here our proposal to further improve semantic predictions and distill the knowledge of the entire system into a single network easily deployable at inference time. First, we predict semantic labels for every image in by our whole framework (i.e. D4 alongside a selected UDA method, referred to as D4-UDA); then, we use these labels to train a new model on . This procedure, also known as self-training [26], has become popular in recent UDA for semantic segmentation literature [72, 73, 69, 29, 33, 30] and consists in training a model by its own predictions, referred to as pseudo-labels, sometimes through multiple iterations. On the other hand, we only perform one iteration, and the novelty of our approach concerns the peculiar ability to leverage on the depth information available for the images in to generate plausible new samples.

Running D4-UDA on yields semantic pseudo-labels for every image in . Yet, as described in Sec. 3.1 (Semantics from Depth), each image in is also endowed with a depth prediction, provided by a self-supervised monocular depth estimation network. We can take advantage of this information to formulate a novel depth-aware data augmentation strategy whereby portion of images and corresponding pseudo-labels are copied onto others so as to synthesize samples for the self-training procedure. The crucial difference between similar approaches presented in [32, 48] and ours consists in the deployment of depth information to steer the data augmentation procedure towards more plausible samples. Indeed, a first intuition behind our method deals with semantic predictions being less accurate for objects distant from the camera: as such predictions play the role of labels in self-training, we prefer to pick closer rather than distant objects in order to generate training samples. Moreover, we reckon certain kinds of objects, like persons, vehicles and traffic signs, to be more plausibly transferable across different images as they tend to be small and less bound to specific spatial locations. On the contrary, it is quite unlikely to merge seamlessly a piece of road or building from a given image into a different one.

Given randomly selected images from , with , paired with semantic pseudo-labels and depth predictions , we augment , by copying on it pixels from the set . For each pixel of the augmented image we have possible candidates, one from itself and from the images in . We filter such candidates according to two criteria: the predicted depth should be lower than a threshold and the semantic prediction should belong to a predefined set of classes, . Hence, we define the set of depths of the filtered candidates at each spatial location as:

(4)

In our experiments, for each image the depth threshold is set to the percentile of the depth distribution, so as to avoid selecting pixels from the farthest objects in the scene. contains all things classes (e.g. person, car, traffic light, etc.), which include foreground elements that can be copied onto other images without altering the plausibility of the scene, while excluding all the stuff classes, which include background elements that cannot be easily moved across scenes. This categorization is similar to the one proposed in [55] and we consider it easy to replicate in other datasets.

Then, we synthesize a new image and corresponding pseudo-labels , by assigning at each spatial location the candidate with the lowest depth, so that objects from different images will overlap plausibly into the synthesized one:

(5)
(6)

In Fig. 4 we depict our depth-based procedure to synthesize new training samples, considering, for the sake of simplicity, the case where is 2.

Hence, with the procedure detailed above, we synthesize an augmented version of , used to distill the whole D4-UDA framework into a single model by a self-training process. This dataset is much larger and exhibits more variability than the original . Due to its reliance on depth information, we dub our novel technique as DBST (Depth-Based Self-Training). The results reported in Sec. 4.3 prove its remarkable effectiveness, both when used as the final stage following D4 as well as when deployed as a standalone self-training procedure applied to any other UDA method.

Method

Road

Sidewalk

Building

Walls

Fence

Pole

T-light

T-sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorbike

Bicycle

mIoU Acc
AdaptSegNet [49] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.6 32.5 35.4 3.9 30.1 28.1 42.4 85.6
D4-AdaptSegNet + DBST 93.1 53.0 85.1 42.8 27.3 35.8 43.9 18.5 85.9 39.0 89.9 63.0 31.6 86.6 39.8 36.7 0 42.4 35.0 50.0 90.3
MaxSquare [5] 88.1 27.7 80.8 28.7 19.8 24.9 34.0 17.8 83.6 34.7 76.0 58.6 28.6 84.1 37.8 43.1 7.2 32.2 34.5 44.3 86.9
D4-MaxSquare + DBST 92.9 51.2 84.7 43.5 22.2 35.7 42.5 20.0 86.2 42.0 90.0 63.7 33.0 86.9 45.5 50.9 0 42.2 41.4 51.3 90.3
BDL [28] 88.2 44.7 84.2 34.6 27.6 30.2 36.0 36.0 85.0 43.6 83.0 58.6 31.6 83.3 35.3 49.7 3.3 28.8 35.6 48.5 89.2
D4-BDL + DBST 93.2 52.6 86.4 44.1 31.2 36.5 42.4 36.1 86.3 41.0 89.8 63.3 37.4 86.3 42.8 57.8 0 40.3 37.9 52.9 90.7
MRNET [68] 90.5 35.0 84.6 34.3 24.0 36.8 44.1 42.7 84.5 33.6 82.5 63.1 34.4 85.8 32.9 38.2 2.0 27.1 41.8 48.3 88.3
D4-MRNET + DBST 93.2 51.6 86.1 45.9 24.5 37.9 47.4 40.4 85.3 37.5 89.6 64.7 39.8 85.8 41.1 53.2 8.9 17.1 33.4 51.7 90.0
Stuff and things* [55] 90.2 43.5 84.6 37.0 32.0 34.0 39.3 37.2 84.0 43.1 86.1 61.1 29.9 81.6 32.3 38.3 3.2 30.2 31.9 48.3 88.8
D4-Stuff and things + DBST 93.3 54.0 86.5 46.4 32.3 37.7 45.2 39.5 85.5 39.4 90.0 63.7 32.8 85.5 32.0 39.5 0 37.7 35.5 51.4 90.5
FADA [54] 92.5 47.5 85.1 37.6 32.8 33.4 33.8 18.4 85.3 37.7 83.5 63.2 39.7 87.5 32.9 47.8 1.6 34.9 39.5 49.2 88.9
D4-FADA + DBST 93.9 58.2 86.4 45.9 29.6 36.9 44.6 27.0 86.3 39.4 90.0 64.9 41.0 85.8 34.6 51.2 9.9 24.2 37.3 52.0 90.7
LTIR [22] 92.9 55.0 85.3 34.2 31.1 34.4 40.8 34.0 85.2 40.1 87.1 61.1 31.1 82.5 32.3 42.9 3 36.4 46.1 50.2 90.0
D4-LTIR + DBST 94.2 59.6 86.9 43.9 35.3 36.9 45.7 36.1 86.2 40.6 90.0 65.9 38.2 84.4 33.3 52.4 13.7 46.2 51.7 54.1 91.0
ProDA [64] 87.8 56.0 79.7 46.3 44.8 45.6 53.5 53.5 88.6 45.2 82.1 70.7 39.2 88.8 45.5 59.4 1.0 48.9 56.4 57.5 89.1
D4-ProDA + DBST 94.3 60.0 87.9 50.5 43.0 42.6 50.8 51.3 88.0 45.9 89.7 68.9 41.8 88.0 45.8 63.8 0 50.0 55.8 58.8 92.1
Table 1: Results on GTA5CS. When available, checkpoints provided by authors are used. * denotes method retrained by us.
Method

Sky

Building

Road

Sidewalk

Fence

Vegetation

Pole

Car

T-Sign

Person

Bicycle

T-Light

mIoU Acc
AdaptSegNet* [49] 75.6 78.0 89.7 28.5 3.4 76.0 28.5 85.1 27.2 55.3 46.6 0 49.5 86.9
D4-AdaptSegNet + DBST 87.7 80.1 94.0 61.8 66.0 81.1 32.2 85.4 31.3 59.0 52.3 0 55.9 90.2
MaxSquare* [5] 72.4 79.2 89.2 36.0 4.6 75.7 31.5 84.9 30.7 55.8 45.8 8.6 51.2 87.3
D4-MaxSquare + DBST 87.5 80.0 93.7 61.8 7.3 80.8 33.2 84.6 35.1 58.1 48.1 8.2 56.5 90.1
MRNET* [68] 84.6 79.7 93.9 56.3 0 80.5 35.4 88.9 27.2 59.4 56.3 0 54.5 90.0
D4-MRNET + DBST 88.3 79.9 93.9 63.0 6.3 81.3 35.5 84.3 31.3 59.5 47.9 0 55.9 90.2
Table 2: Results on SYNSEQCS. * denotes method retrained by us.

4 Experiments

4.1 Implementation Details

Network Architectures. We use Monodepth2 [14] to generate depth proxy-labels for the procedure described in Sec. 3.1. We adapt the general framework presented in [38] to our setting by deploying the popular Deeplab-v2 [3] for depth estimation and semantic segmentation networks. Both networks consist of a backbone and an ASPP module [3], which substitute, respectively, the encoder and decoder used in [38]. The backbone is implemented as a dilated ResNet50 [61]. We also remove the downsampling and upsampling operations used in [38] when learning the transfer function between depth and semantics. More precisely, in our architecture the transfer function is realized as a simple 6-layers CNN with kernel size and Batch Norm [20]. Following the recent trend in UDA for semantic segmentation [49, 5, 28, 68, 55, 54, 22], during DBST we train a single Deeplab-v2 [3]

model, with a dilated ResNet101 pre-trained on Imagenet 

[11] as backbone.

Training Details.

Our pipeline is implemented using PyTorch 

[35] and trained on one NVIDIA Tesla V100 GPU with 16GB of memory. In every training and test phase we resize input images to 1024512, with the exception of DBST, when we first perform random scaling and then random crop with size 1024512. During DBST we use also color jitter to avoid overfitting on the pseudo-labels. In our version of [38], the depth and the transfer network are optimized by Adam [23]

with batch size 2 for 70 and 40 epochs, respectively, while the semantic segmentation network is trained by SGD with batch size 2 for 70 epochs.The final model obtained by DBST is trained again with SGD, batch size 3 and for 30 epochs. We adopt the One Cycle learning rate policy 

[45] in every training, setting the maximum learning rate to but in DBST, where we use .

4.2 Datasets

We briefly describe the datasets adopted in our experiments, pointing to the Suppl. Mat. for additional details. We follow common practice [49, 22, 28] and test our framework in the synthetic-to-real case using GTA5 [39, 40] or SYNTHIA [42] as synthetic datasets. The former consists in synthetic images captured with the game Grand Theft Auto V, while the latter is composed of images generated by rendering a virtual city. Since our method requires video sequences to train Monodepth2 [14], we use the split SYNTHIA VIDEO SEQUENCES (SYNTHIA-SEQ) in the experiments involving the SYNTHIA dataset. As for real images, we leverage the popular Cityscapes dataset [10], which consists in a large collection of video sequences of driving scenes from 50 different cities in Germany.

Figure 5: From left to right: RGB image, prediction from UDA method, prediction from D4-UDA + DBST, GT. The top two rows deal with GTA5CS, the other two with SYNSEQCS. Selected methods are, from top to bottom: LTIR [22], BDL [28], MaxSquare [5] and MRNET [68]. In all these examples our proposal can ameliorate dramatically the output of the given stand-alone method, especially on classes featuring large and regular shapes, like road in rows 1-3, sidewalk in rows 2-4 and wall in row 2.

4.3 Results

We report here experimental results obtained in two domain adaptation benchmarks, which show how the combination with our D4 method allows to boost performance of recent UDA for semantic segmentation approaches.

GTA5CS. Tab. 1 reports results on the most popular UDA benchmark for semantic segmentation, i.e. GTA5CS, where methods are trained on GTA5 and tested on Cityscapes. We selected the most relevant UDA approaches proposed in the last years [49, 5, 28, 68, 55, 54, 22, 64], using checkpoints provided by authors when available. We report per-class and overall results in terms of mean intersection over union (mIoU) and pixel accuracy (Acc), when each method is either used stand-alone or deployed within our proposal (i.e. D4 + DBST). The reader may notice how every UDA method does improve considerably if combined with our proposal, despite the variability of their stand-alone performances. Indeed, AdaptSegNet [49], which yields about 42 in terms of mIoU, reaches 50 when embedded into our framework. Likewise, ProDA, currently considered the s.o.t.a. UDA method, improves in mIoU from 57.5 to 58.8. Moreover, we can observe in Tab. 1 that our method produces a general improvement for all classes, although we experience a certain performance variability for some of them (such as train, motorbike and bicycle

), probably due to noisy pseudo-labels used during DBST. Conversely, our method yields consistently a significant gain on classes characterized by large and regular shapes, namely

road, sidewalk, building, wall and sky. This validates the effectiveness of a) the geometric cues derivable from depth to predict the semantics of these kind of objects and b) the methodology we propose to leverage on these additional cues in UDA settings. This behavior is also clearly observable from qualitatives in Fig. 5. We point out that, to the best of our knowledge, the performance obtained by D4-ProDA + DBST, i.e. 58.8 mIoU (last row of Tab. 1) establishes the new state-of-the-art for GTA5CS.

SYNSEQCS. Akin to common practice in literature we present results also on the popular SYNTHIA dataset. As our pipeline requires video sequences to train the self-supervised monocular depth estimation network, we select the SYNTHIA VIDEO SEQUENCES split for training and the Cityscapes dataset for testing. We will call this setting SYNSEQCS. To address it, we re-trained the UDA methods for which the code is available and the training procedure is more affordable in terms of memory and run-time requirements, namely AdaptSegNet [49], MaxSquare [5] and MRNET [68]. The results in Tab. 2 show that all the selected UDA approaches exhibit a substantial performance gain when coupled with our proposal, with a general improvement in all classes. In particular, similarly to the results obtained in GTA5CS, we observe a consistent improvement for classes related to objects with large and regular shapes (as depicted also in Fig. 5), with the only exception of a slight performance drop for the class building when using MRNET [68] (last row of Tab. 2). We argue that our approach is relatively less effective with MRNET [68] as, unlike AdaptSegNet [49] and MaxSquare [5], it yields already satisfactory results in those classes which are usually improved by the geometric clues injected by D4.

In the Suppl. Mat. we show that it is also possible to exploit the depth ground-truths provided by the SYNTHIA dataset as an additional source of supervision during the training of Monodepth2 [14], obtaining a small improvement in the performances of the overall framework.

4.4 Analysis

We report here the most relevant analysis concerning our work. Additional ones can be found in the Suppl. Mat..

Ablation studies. In Tab. 3, we analyze the impact on the performance of our two main contributions, i.e. injection of geometric cues into UDA methods by D4 and DBST. Purposely, we select the GTA5CS benchmark and, for the top performing UDA methods, we report the mIoU figures obtained by using the stand-alone UDA method (column UDA), combining it with D4 (column D4-UDA), applying DBST directly on the stand-alone method (column UDA + DBST) and embedding the method into our full pipeline (column D4-UDA + DBST). We can observe that each of our novel contributions improves the performance of the most recent UDA methods by a large margin, which is even more remarkable considering that the selected methods already include one or more step of self-training. Moreover, D4 and DBST further enhance the performances of any selected method when deployed jointly, as shown in the column D4-UDA + DBST, suggesting that they are complementary. In order to further assess the effectiveness of DBST, in the column D4-UDA + ST we report results obtained by D4-UDA in combination with a baseline self-training procedure, which consists in simply fine-tuning the model by its own predictions on the images of the target domain. As the only difference between this procedure and our DBST is the dataset employed for fine-tuning, the results prove the effectiveness of DBST in generating a varied set of plausible samples more amenable to self-training than the original images belonging to the target domain.

Alternative strategies to exploit depth. As explained in Sec. 3.1 Semantics from Depth, we rely on the mechanism of transferring features across tasks and domains from [38] to inject depth cues into semantic segmentation. To validate our choice, we explore two possible alternatives, namely DeepLabV2-RGBD and DeepLabV2-Depth. Both consist in the popular DeepLabV2 [3] network, with RGBD images in input in the first case and depth maps (no RGB) in the second (more details in the Suppl. Mat.). Tab. 4 compares the performance of these alternatives with our method, either when used standalone (rows 2, 3, and 4) or when combined with LTIR [22] according to the strategy presented in Sec. 3.1 Combine with UDA. Results allow us to make some important considerations. First, our intuition on the possibility of exploiting depth to improve semantics is correct since also simple approaches improve over the baseline (reported in the first row of the table). Nonetheless, these naive methods produce a significantly smaller improvement compared to our approach, showing that our decision to adapt [38] to the UDA scenario is not obvious. Moreover, [38] requires only RGB images at test time. Finally, when combined with LTIR [22], a stronger depth-to-semantic model provides better results, validating our choice once again.

Method UDA D4-UDA UDA + DBST D4-UDA + DBST D4-UDA + ST
BDL [28] 48.5 49.6 51.7 52.9 50.1
MRNET [68] 48.3 49.6 50.0 51.7 50.3
Stuff and Things* [55] 48.3 49.1 50.4 51.4 49.4
FADA [54] 49.3 49.9 51.4 52.0 50.0
LTIR [22] 50.2 51.1 53.1 54.1 51.5
ProDa [64] 57.5 57.6 58.0 58.8 56.8
Table 3: Impact on performance of the two components of our proposal (D4, DBST) when applied separately or jointly to selected UDA methods on GTA5CS. * indicates that the method was retrained by us. Results are reported in terms of mIoU.
Method mIoU
DeepLabV2-RGB 34.5
DeepLabV2-RGBD 35.5
DeepLabV2-Depth 36.5
Semantics from depth (Sec. 3.1) 43.1
DeepLabV2-RGBD LTIR [22] 47.7
DeepLabV2-Depth LTIR [22] 49.3
D4-LTIR 51.1
Table 4: Comparison between alternative methods to infer semantics from depth. DeepLabV2-RGB, DeepLabV2-RGBD and DeepLabV2-Depth stand for DeepLabV2 [3] trained on , using respectively RGB images, RGBD images or depth proxy-labels as input, while “Semantics from depth” is the approach described in Sec. 3.1 Semantics from Depth. The symbol represents the merge operation described in Sec. 3.1 Combine with UDA. Results are reported in terms of mIoU on the Cityscapes dataset.

Impact of video sequences. As described in Sec. 3.1, we obtain depth proxy-labels with a self-supervised depth estimation network [14], that we train using the raw video sequences (just RGB images) provided by the datasets involved in our experiments. In order to validate that using video sequences from the target domain doesn’t provide any advantage to our framework, we train AdaptSegNet [49] on GTA5CS using the whole training split available for Cityscapes (i.e. 83300 images with temporal consistency). We choose AdaptSegNet [49] since it can be considered the building block of many UDA methods. We observe a drop in performances from 42.4 to 41.9 mIoU, showing that using video sequences does not boost semantic segmentation in a UDA setting, probably because of the similarity between consecutive frames, and that the improvement produced by our framework is provided by the effective strategy that we adopt to exploit depth.

5 Conclusion

We have shown how to exploit self-supervised monocular depth estimation in UDA problems to obtain accurate semantic predictions for objects with strong geometric priors (like road and buildings). As all recent UDA approaches lack such geometric knowledge, we build our D4 method as a depth-based add-on, pluggable into any UDA method to boost performances. Finally, we employed self-supervised depth estimation to realize an effective data augmentation strategy for self-training. Our work highlights the possibility of exploiting auxiliary tasks learned by self-supervision to better tackle UDA for semantic segmentation, paving the way for novel research directions.

References

  • [1] M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh (2019-06) Unsupervised domain adaptation for semantic segmentation of urban scenes. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    ,
    Cited by: §2.
  • [2] W. Chang, H. Wang, W. Peng, and W. Chiu (2019-06) All about structure: adapting structural information across domains for boosting semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728132938, Link, Document Cited by: §2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018-04) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. External Links: ISSN 2160-9292, Link, Document Cited by: §1, §4.1, §4.4, Table 4.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1.
  • [5] M. Chen, H. Xue, and D. Cai (2019-10) Domain adaptation for semantic segmentation with maximum squares loss. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: Table 1, Table 2, Figure 5, §4.1, §4.3, §4.3.
  • [6] Y. Chen, W. Li, X. Chen, and L. V. Gool (2019) Learning semantic segmentation from synthetic data: a geometrically guided input-output adaptation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1841–1850. Cited by: §1.
  • [7] Y. Chen, W. Li, X. Chen, and L. Van Gool (2019-06) Learning semantic segmentation from synthetic data: a geometrically guided input-output adaptation approach. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728132938, Link, Document Cited by: §2.
  • [8] Y. Chen, W. Li, and L. V. Gool (2018-06) ROAD: reality oriented adaptation for semantic segmentation of urban scenes. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Link, Document Cited by: §1.
  • [9] J. Choi, T. Kim, and C. Kim (2019-10) Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §2.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016-06)

    The cityscapes dataset for semantic urban scene understanding

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.2.
  • [11] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Document Cited by: §1, §4.1.
  • [12] R. Garg, V. K. B.G., G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. Lecture Notes in Computer Science, pp. 740–756. External Links: ISBN 9783319464848, ISSN 1611-3349, Link, Document Cited by: §1.
  • [13] C. Godard, O. M. Aodha, and G. J. Brostow (2017-07) Unsupervised monocular depth estimation with left-right consistency. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781538604571, Link, Document Cited by: §1.
  • [14] C. Godard, O. M. Aodha, M. Firman, and G. Brostow (2019-10) Digging into self-supervised monocular depth estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §1, §1, §3.1, §4.1, §4.2, §4.3, §4.4.
  • [15] V. Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon (2020) Semantically-guided representation learning for self-supervised monocular depth. In Proceedings of the Eighth International Conference on Learning Representations (ICLR), Cited by: §1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document Cited by: §1.
  • [17] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §2.
  • [18] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. External Links: 1612.02649 Cited by: §1.
  • [19] L. Hoyer, D. Dai, Y. Chen, A. Koring, S. Saha, and L. Van Gool (2021) Three ways to improve semantic segmentation with self-supervised depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11130–11140. Cited by: footnote 1.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Cited by: §4.1.
  • [21] J. Jiao, Y. Cao, Y. Song, and R. Lau (2018-09) Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1.
  • [22] M. Kim and H. Byun (2020-06) Learning texture invariant representation for domain adaptation of semantic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728171685, Link, Document Cited by: Figure 1, Figure 2, §2, Table 1, Figure 5, §4.1, §4.2, §4.3, §4.4, Table 3, Table 4.
  • [23] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [24] M. Klingner, J. Termöhlen, J. Mikolajczyk, and T. Fingscheidt (2020) Self-supervised monocular depth estimation: solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision, pp. 582–600. Cited by: §1.
  • [25] J. N. Kundu, N. Lakkakula, and R. V. Babu (2019) UM-adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1436–1445. Cited by: §2.
  • [26] D. Lee (2013)

    Pseudo-label : the simple and efficient semi-supervised learning method for deep neural networks

    .
    In Workshop on challenges in representation learning, ICML, Cited by: §2, §3.2.
  • [27] K. Lee, G. Ros, J. Li, and A. Gaidon (2019) Spigan: privileged adversarial learning from simulation. In International Conference on Learning Representations, Cited by: §1.
  • [28] Y. Li, L. Yuan, and N. Vasconcelos (2019-06) Bidirectional learning for domain adaptation of semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728132938, Link, Document Cited by: §2, Table 1, Figure 5, §4.1, §4.2, §4.3, Table 3.
  • [29] Q. Lian, L. Duan, F. Lv, and B. Gong (2019-10) Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: a non-adversarial approach. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §3.2.
  • [30] K. Mei, C. Zhu, J. Zou, and S. Zhang (2020-08) Instance adaptive self-training for unsupervised domain adaptation. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.2.
  • [31] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2018-06) Image to image translation for domain adaptation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Link, Document Cited by: §2.
  • [32] V. Olsson, W. Tranheden, J. Pinto, and L. Svensson (2020)

    ClassMix: segmentation-based data augmentation for semi-supervised learning

    .
    External Links: 2007.07936 Cited by: §3.2.
  • [33] F. Pan, I. Shin, F. Rameau, S. Lee, and I. S. Kweon (2020-06) Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728171685, Link, Document Cited by: §1, §2, §3.2.
  • [34] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Link Cited by: §1, §3.1.
  • [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS 2017 Workshop on Autodiff, External Links: Link Cited by: §4.1.
  • [36] F. Pizzati, R. d. Charette, M. Zaccaria, and P. Cerri (2020-03) Domain bridge for unpaired image-to-image translation and unsupervised domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.
  • [37] P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano (2018) Geometry meets semantics for semi-supervised monocular depth estimation. In Asian Conference on Computer Vision, pp. 298–313. Cited by: §1.
  • [38] P. Z. Ramirez, A. Tonioni, S. Salti, and L. D. Stefano (2019-10) Learning across tasks and domains. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §1, §1, §2, §3.1, §3.1, §4.1, §4.1, §4.4.
  • [39] S. R. Richter, Z. Hayder, and V. Koltun (2017) Playing for benchmarks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2232–2241. External Links: Link, Document Cited by: §4.2.
  • [40] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. Lecture Notes in Computer Science, pp. 102–118. External Links: ISBN 9783319464756, ISSN 1611-3349, Link, Document Cited by: §1, §1, §4.2.
  • [41] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. External Links: ISBN 9783319245744, ISSN 1611-3349, Link, Document Cited by: §1.
  • [42] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016-06) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.2.
  • [43] S. Saha, A. Obukhov, D. P. Paudel, M. Kanakis, Y. Chen, S. Georgoulis, and L. Van Gool (2021) Learning to relate depth and semantics for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8197–8207. Cited by: §2.
  • [44] E. Shelhamer, J. Long, and T. Darrell (2017-04) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 640–651. External Links: ISSN 2160-9292, Link, Document Cited by: §1.
  • [45] L. N. Smith and N. Topin (2019-05) Super-convergence: very fast training of neural networks using large learning rates. Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. External Links: ISBN 9781510626782, Link, Document Cited by: §4.1.
  • [46] M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790. Cited by: §1.
  • [47] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia (2019) Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9799–9809. Cited by: §1.
  • [48] W. Tranheden, V. Olsson, J. Pinto, and L. Svensson (2021-01) DACS: domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1379–1389. Cited by: §2, §3.2.
  • [49] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018-06) Learning to adapt structured output space for semantic segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Link, Document Cited by: §2, Table 1, Table 2, §4.1, §4.2, §4.3, §4.3, §4.4.
  • [50] Y. Tsai, K. Sohn, S. Schulter, and M. Chandraker (2019-10) Domain adaptation for structured output via discriminative patch representations. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §2.
  • [51] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4068–4076. Cited by: §2.
  • [52] T. Vu, H. Jain, M. Bucher, M. Cord, and P. P. Perez (2019-10) DADA: depth-aware domain adaptation in semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §2.
  • [53] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez (2019-06) ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728132938, Link, Document Cited by: §2.
  • [54] H. Wang, T. Shen, W. Zhang, L. Duan, and T. Mei (2020-08) Classes matter: a fine-grained adversarial approach to cross-domain semantic segmentation. In The European Conference on Computer Vision (ECCV), Cited by: §2, Table 1, §4.1, §4.3, Table 3.
  • [55] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020-06) Differential treatment for stuff and things: a simple unsupervised domain adaptation method for semantic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728171685, Link, Document Cited by: Figure 1, §2, §3.2, Table 1, §4.1, §4.3, Table 3.
  • [56] K. Watanabe, K. Saito, Y. Ushiku, and T. Harada (2018) Multichannel semantic segmentation with unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1, §2.
  • [57] J. Watson, M. Firman, G. J. Brostow, and D. Turmukhambetov (2019) Self-supervised monocular depth hints. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2162–2171. Cited by: §1.
  • [58] Z. Wu, X. Han, Y. Lin, M. G. Uzunbas, T. Goldstein, S. N. Lim, and L. S. Davis (2018) DCAN: dual channel-wise alignment networks for unsupervised scene adaptation. Lecture Notes in Computer Science, pp. 535–552. External Links: ISBN 9783030012281, ISSN 1611-3349, Link, Document Cited by: §1, §2.
  • [59] J. Yang, R. Xu, R. Li, X. Qi, X. Shen, G. Li, and L. Lin (2020-04) An adversarial perturbation oriented domain adaptation approach for semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence 34 (07), pp. 12613–12620. External Links: ISSN 2159-5399, Link, Document Cited by: §2.
  • [60] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §1.
  • [61] F. Yu, V. Koltun, and T. Funkhouser (2017-07) Dilated residual networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781538604571, Link, Document Cited by: §4.1.
  • [62] P. Zama Ramirez, A. Tonioni, and L. Di Stefano (2018) Exploiting semantics in adversarial training for image-level domain adaptation. In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), Vol. , pp. 49–54. External Links: Document Cited by: §2.
  • [63] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018)

    Taskonomy: disentangling task transfer learning

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712–3722. Cited by: §1.
  • [64] P. Zhang, B. Zhang, T. Zhang, D. Chen, Y. Wang, and F. Wen (2021) Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. arXiv preprint arXiv:2101.10979. Cited by: §2, Table 1, §4.3, Table 3.
  • [65] Y. Zhang and Z. Wang (2020) Joint adversarial learning for domain adaptation in semantic segmentation. In AAAI, Cited by: §2.
  • [66] Y. Zhang, P. David, and B. Gong (2017-10) Curriculum domain adaptation for semantic segmentation of urban scenes. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §1.
  • [67] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei (2018-06) Fully convolutional adaptation networks for semantic segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Link, Document Cited by: §2.
  • [68] Z. Zheng and Y. Yang (2020-07) Unsupervised scene adaptation with memory regularization in vivo. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. External Links: ISBN 9780999241165, Link, Document Cited by: §2, Table 1, Table 2, Figure 5, §4.1, §4.3, §4.3, Table 3.
  • [69] Z. Zheng and Y. Yang (2020) Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision (IJCV). External Links: Document Cited by: §1, §2, §3.2.
  • [70] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017-07) Unsupervised learning of depth and ego-motion from video. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781538604571, Link, Document Cited by: §1.
  • [71] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017-10) Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §2.
  • [72] Y. Zou, Z. Yu, B. V. Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §1, §2, §3.2.
  • [73] Y. Zou, Z. Yu, X. Liu, B.V.K. V. Kumar, and J. Wang (2019-10) Confidence regularized self-training. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.2.