Unsupervised Domain Adaptation for Video Semantic Segmentation

07/23/2021 ∙ by Inkyu Shin, et al. ∙ KAIST 수리과학과 5

Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real (Sim2Real) by largely cutting out the laborious per pixel labeling efforts at real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic Segmentation. As it became easy to obtain large-scale video labels through simulation, we believe attempting to maximize Sim2Real knowledge transferability is one of the promising directions for resolving the fundamental data-hungry issue in the video. To tackle this new problem, we present a novel two-phase adaptation scheme. In the first step, we exhaustively distill source domain knowledge using supervised loss functions. Simultaneously, video adversarial training (VAT) is employed to align the features from source to target utilizing video context. In the second step, we apply video self-training (VST), focusing only on the target data. To construct robust pseudo labels, we exploit the temporal information in the video, which has been rarely explored in the previous image-based self-training approaches. We set strong baseline scores on 'VIPER to CityscapeVPS' adaptation scenario. We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.



There are no comments yet.


page 1

page 4

page 5

page 7

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding each pixel in the video is one of the fundamental problems for autonomous driving [playfordata], video editing [kim2019deep], and high-level video understanding [jung2019discriminative]. Recently, this essential problem has been actively investigated in various forms: video semantic segmentation [nilsson2017semantic, gadde2017semantic, fayyaz2016stfcn, li2018lowlatency, liu2020efficient, shelhamer2016clockwork, zhu2017deep, xu2018dynamic], video instance segmentation [yang2019video, Feng_2019_ICCV, athar2020stemseg], and video panoptic segmentation [kim2020vps].

However, the advance in dense video understanding is still not much as image tasks. We see the main bottleneck comes from the lack of large-scale real-world video datasets. Creating video datasets in large-scale is much more challenging than image datasets in that the human annotator should ensure consistent labels across the frames. Furthermore, given the same budget, video datasets might be less diverse since it requires multiple highly redundant frame annotations. Therefore, simulators and game engines [Richter_2017] are recently emerging as attractive alternatives, which can automatically generate accurate and consistent labels under various conditions. However, the models trained on synthetic data cannot be directly deployed in real-world due to a fundamental domain shift [sankaranarayanan2018learning].

To alleviate the domain shift and seamlessly use the knowledge learned from simulation, lots of progress has been made on Unsupervised Domain Adaptation (UDA). However, all the previous work focuses on image-level adaptation, and the cross-domain video-level adaptation has been rarely explored. For example, Image UDA methods [tsai2018learning, vu2019advent, wang2020differential, zou2018domain, zou2019confidence, li2019bidirectional, pan2020unsupervised, shin2020two, mei2020instance] align features image-wise, which is suboptimal for video as it overlooks important temporal continuities in the video. Here, we explore unsupervised domain adaptation for video semantic segmentation. The task aims to learn spatio-temporal knowledge from the large-scale labeled simulation and transfer it to the unlabeled real. Consequently, we can produce spatially accurate and temporally consistent semantic segmentation predictions even with few or no labels in the real.

To tackle this new problem, we present a novel two-phase adaptation scheme. In each step, we apply the video-specific adaptation technique; VAT and VST. In the first phase, we train video models using large-scale source data. As we can access full labels, we use standard cross-entropy loss and the newly presented tube matching loss jointly. Meanwhile, we employ VAT that aligns features from source to target using the space-time video context. In the second phase, we apply VST, focusing on the target data learning. By exploiting the temporal information in the video, we obtain more accurate pseudo labels. Specifically, we generate the pseudo labels based on the temporally aggregated predictions and refine them using temporal consensus check.

We summarize our contributions as follows:

  1. We firstly define and explore unsupervised domain adaptation for video semantic segmentation.

  2. We design a novel two-phase video domain adaptation scheme. The two main components are Video Adversarial Training (VAT) and Video Self Training (VST). We show that both are essential and complementary in constructing a compelling video adaptation method.

  3. We validate our proposal on the Viper to Cityscape-VPS scenario. We show that our method significantly outperforms all the previous image-based UDA approaches both on image-specific (mIoU) and video-specific (VPQ [kim2020vps]) evaluation metrics. Our results clearly show the importance of developing video-specific adaptation techniques.

2 Related Works

2.1 Image Domain Adaptation

Domain Adaptation(DA) has been explored as a classic problem proposing a method of transferring knowledge from label-rich domains to label-scarce domains in image space. As a result, it is showing great effects on semantic segmentation which requires laborious pixel-level labeling. According to the adaptation methodology, it can be largely divided into Adversarial learning based [tsai2018learning, vu2019advent, wang2020differential] DA and Self-training based [zou2018domain, zou2019confidence] DA. Recently, the adaptation effect is maximized by mixing this two methods [li2019bidirectional, pan2020unsupervised, shin2020two, mei2020instance]. Among them, IAST [mei2020instance] initially reduce two domain’s distribution gap via adversarial learning, called the warm-up stage, and proceed with their self-training method.

However, it is not appropriate to simply apply the above methods to the video domain where the temporal correlation between frames needs to be considered. Therefore, video-specific domain adaptation is expected to address this problem but its development is still lagging behind video-specific recognition.

2.2 Video Semantic Segmentation

Among the video-specific recognition tasks (e.g., Video Semantic segmentation [nilsson2017semantic, gadde2017semantic, fayyaz2016stfcn, li2018lowlatency, liu2020efficient, shelhamer2016clockwork, zhu2017deep, xu2018dynamic], Video Instance Segmentation [yang2019video, Feng_2019_ICCV, athar2020stemseg], Video Panoptic Segmentation [kim2020vps]), video semantic segmentation(VSS) has been in the spotlight for the longest time. It requires dense labeling for all pixels in each frame of a video sequence into a few semantic categories. Previous work can be summarized into two paradigms depending on where to focus between accuracy [nilsson2017semantic, gadde2017semantic, fayyaz2016stfcn, li2018lowlatency] and efficiency [shelhamer2016clockwork, zhu2017deep, xu2018dynamic]. They typically aim to capture the temporal relations between labeled data and unlabeled data on the same video clip, as there are no continuously dense labels on real-world datasets [Cordts2016Cityscapes, kim2020vps]. On the contrary, since simulator data [Richter_2017] can be labeled in all of the frames in video clips due to labeling software built in game engine, we need to design new video specific loss for ground truth video tube labels. Therefore, we propose source-specific tube matching loss and empirically confirm that this loss helps a model achieves high performance not only source domain itself but also target domain in the video domain adaptation framework.

2.3 Video Domain Adaptation

Not similar to image-based DA, video-based DA is still an area that lacks research. Nevertheless, due to its importance, quite a few video DA [jamal2018deep, pan2020adversarial, chen2019temporal, pan2020adversarial, choi2020shuffle] research has been underway recently in the classification task. AMLS [jamal2018deep] utilizes C3D [tran2015learning] features of the Grassmann manifold acquired using PCA in the task of action classification. TA[chen2019temporal], TC[pan2020adversarial] and SAVA [choi2020shuffle] focus on achieving domain adaptive video classification by learning temporal dynamics with attention mechanism. SAVA [choi2020shuffle] goes further with using self-supervised clip order prediction. However, the video domain adaptation for semantic segmentation we propose is not a simple extension of that of the video classification. Accordingly, we present a new framework of video unsupervised domain adaptation for video semantic segmentation.

3 Preliminary

3.1 Problem Setting in image UDA

Following the common UDA setting in the image domain, we have full access to the data and labels, , in the labeled source domain.In contrast, in the unlabeled target domain, we can only utilize the data, .

3.2 Image Adversarial training(IAT)

Adversarial training in image DA can be conducted via minimizing source-specific segmentation loss and generative adversarial loss. Let G, and are semantic segmentation network, predicted source, and target segmentation map, respectively. Source-specific segmentation loss is defined as where is cross-entropy.

Besides, to align the distribution between source and target image, adversarial learning [tsai2018learning] is applied.


A fully-convolutional image discriminator tries to discriminate between target segmentation map and source one , while the segmentation network G attempts to fool the discriminator. Therefore, full objective function related to IAT is combined to be below (2):


3.3 Image Self training(IST)

Self-training generates pseudo labels and retrains the model with them. Generally, the pseudo label can be formulated as


where is a function which returns the input if the condition is true or ignored otherwise, and are pseudo label and prediction for class k, respectively. When the confidence is above a certain threshold , the predicted class is selected as pseudo label.

How to define is crucial for selecting informative pseudo labels. In particular, IAST [mei2020instance] adaptively adjust for each images to diversify the pseudo labels. The threshold of current image , is decided as follows:


The is moving average of previous threshold and representative threshold for current image

, with hyperparameter

. Basically, is top

× 100% confidence probability, while additional

term reduce the portion based on difficulty of each class. Note that “Hard” classes which usually have noisy labels are largely reduced.

Given pseudo labels on the target domain, the image self-training loss can be defined by below (5):


The third term represents regularization for self-training. Several regularizers are proposed; confident region KLD minimization [zou2019confidence] to prevent overfitting on pseudo labels; ignore region entropy minimization [mei2020instance] to learn from non pseudo label regions.

4 Methods

Figure 2: The overview of the proposed unsupervised domain adaptation for video semantic segmentation(Video DA). Our Video DA consists of two phase video specific domain adaptation training: Video Adversarial Training(VAT) and Video Self Training(VST). In first phase, VAT, we stack two neighbor outputs from two domains( for source and for target) and utilize sequence discriminator to align them. In second phase, VST, we design a aggregation based clip adaptive pseudo label generation strategy.

4.1 Problem Setting in Video UDA

Video domain adaptation aims to transfer space-time knowledge from labeled source domain S to the unlabeled target domain T. Different from the single frame setting in image UDA, source set consists of temporally ordered frame sequences, , and corresponding label sequences, . On the contrary, in the unlabeled target domain, we only have the unlabeled data. We denote target frame sequences as .

4.2 Per-frame Inference and Baselines

Motivated by  [liu2020efficient], we adopt image semantic segmentation models and process each video frame independently for inference. In this way, we can start from a well-investigated image unsupervised domain adaptation framework. We build our framework based on the state-of-the-art framework, IAST [mei2020instance], which consists of two learning phases: adversarial warm-up training and instance adaptive self-training.

Unlike static images, videos have rich temporal and motion information. To train a model with such information, we extend each phase to video adversarial training and self-training by designing sequence discriminator and temporal-corresponding pseudo labels respectively. We detail each step in the following sections.

4.3 Video Adversarial training(VAT)

Given input source frames , from time and , the image semantic segmentation model predicts “soft-segmentation map” . Image-level semantic segmentation is learned from conventional cross-entropy loss on each frames.

Tube Matching Loss Training a model with each frame independently may miss rich temporal information that might be able to learn from temporally dense labels. We propose a novel tube matching loss that compares the tube prediction of each class with the groundtruth tube. Formally, the sequence of source prediction is expressed as that represents concatenated two predictions in temporal dimensions. We define a tube prediction as a stack of class k predictions on each frame. Groundtruth tube is defined similarly. The tube matching loss is defined by class-wise dice coefficient [milletari2016vnet] between the tube prediction and groundtruth tube .


The dice coefficient is insensitive to number of each class prediction and defined as


Tube matching loss not only allows model to learn space-time knowledge but also fully utilize temporally dense labels from source domains.

Sequence Discriminator To alleviate domain discrepancy both in space and time, we propose to use sequence discriminator. Different from image discriminator [tsai2018learning, vu2019advent], sequence discriminator takes sequence of soft segmentation maps as an input. The discriminator is trained to discriminate sequence of source and target. At the same time, the segmentation model attempts to fool the discriminator. The video GAN loss is as follows:


where is sequence of prediction on target images, and . By construction, the semantic segmentation model learns to align sequence distribution between source and target.

Full objective function for VAT Total loss function is defined as


Along with per-frame cross entropy loss, the tube matching loss enforce model to produce accurate and consistent prediction on source domain, while video gan loss adapts the learned representation to the target domain.

4.4 Video Self Training(VST)

4.4.1 Clip Adaptive Pseudo Label Generation with Temporally Aggregated Prediction

To select diverse pseudo-labels, IAST [mei2020instance] adaptively adjusts the threshold of each class on the instance, viewing each image as a instance. While retaining their strong point, we extend the strategy of pseudo-label generation in video-specific manner. Algorithm 1 details our pseudo label generation process.

Temporally Aggregated Prediction In pseudo label generation step, we take two neighbor frames (), to aggregate their predictions onto a target frames, . Given a target frame and a neighbor frame

, pixel-level correspondence is estimated by an optical flow network(e.g. FlowNet2.0 

[ilg2016flownet]). The predicted soft-segmentation maps on the neighbor frame are warped to the target frame as follows. To aggregate diverse prediction from multiple frames, we average the predictions from target and neighbor frame when the warped prediction is not occluded. The non-occluded map from to is defined by where is threshold value and denotes the occlusion mask which is calculated from the warping error between target frame and warped neighbor frame . The aggregated prediction at target frame is computed as,


Thus, sequence of aggregated prediction for a clip is expressed as, where N represent total frame number of each clip. We abbreviate above process as Aggregate.

Clip Adaptive Pseudo Label Generation As consecutive frames share a large amount of redundant information, we define each clip as a single instance. The threshold of current clip , for class k, is defined as follows:


The same threshold is adopted to each frame, which is exponential moving average between threshold of previous clip and . is acquired similar to IAST, while the operating unit is sequence of predictions rather than a prediction of single image.

4.4.2 Video Self-training with Online Pseudo Label Refinement

Pseudo Label Refinement via Temporal Consensus Same pixels across the video should belongs to the same class. However, in practice, it is not always held on the pseudo labels. To eliminate such noise, we implement pseudo label refinement with temporal correspondence. Specifically, reference pseudo label is warped onto current pseudo label and check the consensus between them. If not, the model is not learned from the region. The refined current pseudo label can be formulated as follow:

Input: Model G,
Parameter: proportion , momentum
Output: target video pseudo-labels
1 init =0.9 for T=1 to N do
2       // * Eq. (10) * // for  in  do
3             = = // * Eq. (11) * //
4      end for = // * Eq. (11) * // =
end for return
Algorithm 1 Pseudo label generations
Figure 3: Visualization of pseudo-labels w/o and w/ the online refinement. The proposed online refinement successfully eliminate noise on the pseudo labels by checking temporal consensus.

Full objective function for VST Overall loss function is expressed as

VIPER Cityscapes-VPS [mIoU]







traffic light

traffic sign










Source-only - 44.24 22.70 73.69 6.02 6.74 10.47 19.25 82.55 31.13 80.74 60.91 62.37 2.25 7.18 0.00 34.02
AdaptSegNet [tsai2018learning] AT 86.09 41.54 80.59 16.10 12.71 20.00 23.98 82.44 34.05 80.14 64.79 69.23 6.41 4.15 0.0 41.48
Advent [vu2019advent] AT 87.10 42.27 80.57 18.46 12.90 21.69 24.88 82.22 31.87 79.82 65.88 74.53 6.83 7.32 0.0 42.42
CBST [zou2018domain] ST 27.09 27.00 80.83 13.88 2.55 24.01 18.10 82.95 48.69 82.84 61.29 73.29 23.71 5.91 0.0 38.14
CRST [zou2019confidence] ST 25.52 23.71 81.15 18.85 3.02 25.57 20.39 83.01 46.66 80.91 63.06 74.65 22.76 12.38 0.52 38.81
CBST* [zou2018domain] A+S 88.70 56.66 82.88 32.16 17.59 32.30 19.38 83.80 40.74 74.57 69.68 69.00 17.77 12.32 00.83 46.56
CRST* [zou2019confidence] A+S 88.73 56.64 83.04 31.66 17.79 33.89 19.02 83.98 40.00 75.00 70.82 69.60 18.48 12.34 1.38 46.83
IAST [mei2020instance] A+S 91.35 62.29 85.55 35.75 19.62 36.51 27.26 86.10 38.98 84.36 70.89 73.03 13.33 14.43 0.90 49.36
Ours A+S 91.23 53.99 85.86 35.90 18.73 43.43 34.97 86.28 36.09 86.01 65.61 81.81 21.74 34.91 2.04 51.91
Oracle - 94.98 67.68 87.42 61.10 32.20 40.71 56.11 86.82 55.76 87.47 70.22 89.55 51.65 73.43 0.71 63.72
Table 1: Image semantic segmentation results(mIoU).
VIPER Cityscapes-VPS []







traffic light

traffic sign









Source-only - 28.55 3.84 65.95 0.45 0.0 0.1 5.37 75.22 5.50 64.29 18.38 48.50 1.34 0.0 0.0 21.17
AdaptSegNet [tsai2018learning] AT 84.97 18.90 75.87 4.66 0.05 2.98 8.02 74.76 6.69 65.26 20.85 59.02 2.19 0.0 0.0 28.28
Advent [vu2019advent] AT 85.55 20.02 74.59 6.90 0.27 3.31 8.04 74.70 6.66 65.64 22.80 65.84 2.03 0.14 0.0 29.10
CBST [zou2018domain] ST 5.56 8.60 74.88 7.05 0.05 3.14 5.27 75.53 12.31 67.83 21.71 61.30 4.72 0.0 0.0 23.20
CRST [zou2019confidence] ST 5.63 6.51 76.27 5.89 0.07 3.21 5.74 75.29 11.06 64.11 24.73 62.37 6.42 1.56 0.0 23.26
CBST* [zou2018domain] A+S 87.67 41.12 78.66 10.39 0.88 6.47 3.86 77.97 10.28 51.52 29.53 56.88 5.36 0.0 0.0 30.71
CRST* [zou2019confidence] A+S 87.74 41.80 78.81 9.90 1.03 8.16 3.63 78.18 10.87 52.07 30.02 57.50 5.58 0.0 0.35 31.04
IAST [mei2020instance] A+S 90.06 45.86 82.54 14.63 2.60 6.98 8.00 80.78 14.84 70.21 34.56 62.84 5.51 1.04 0.19 34.71
Ours A+S 90.64 37.32 82.68 11.31 2.46 10.76 14.40 81.20 13.97 75.42 32.68 74.89 9.30 9.37 0.0 36.43
Oracle - 94.81 49.98 84.24 23.16 4.37 7.10 28.58 80.40 17.73 73.84 32.84 78.63 10.57 35.00 0.0 41.42
Table 2: Video semantic segmentation results().

5 Experiments

5.1 Experimental settings

Datasets We evaluate our video domain adaptation framework on the synthetic-to-real scenarios: VIPER [Richter_2017] to Cityscapes-VPS [kim2020vps]. The VIPER dataset has 254,064 video frames and corresponding semantic labels. We only take 42455 images on “day” conditions as training data. Cityscape-VPS is video-level extension of Cityscapes dataset [Cordts2016Cityscapes], which further annotate 5 frames out of each 30-frames video clip. Following official split in  [kim2020vps], we use 400 clips to train a model and evaluated on 50 validation sequences.

Baselines We compare our methods with several baselines. (1) Source-only is a model naively trained on source domain. (2) Adversarial based image UDA methods include AdaptSegNet [tsai2018learning], Advent [vu2019advent] which rely solely on GAN based adversarial training. (3) Self-training based image UDA methods, CBST [zou2018domain] and CRST [zou2019confidence], iteratively generate pseudo-labels and retrain the models. (4) Advanced methods such as IAST [mei2020instance] combine adversarial and self-training. In addition, we build strong baselines, CBST* and CRST*, by applying each methods on top of adversarially adapted model that is used in IAST.

Evaluation Metric The predicted results should be both accurate as groundtruth and consistent in time. We first evaluate image-level semantic segmentation performance using standard mean intersection of union(mIoU). In addition, we borrow the video panoptic quality(VPQ) [kim2020vps] to measure the video-level accuracy. To fit the task of video semantic segmentation, we view the prediction of each class as single instance (i.e. stuff) when the video panoptic quality is calculated. We call the modified VPQ metric in this paper. We measure mIoU and on 15 common classes.

Implementation details We use a pre-trained ResNet-50 [he2015deep] as backbone network and adapt DeepLabv2 [chen2017deeplab] as semantic segmentation head. Due to the limitation of GPU memory, we randomly select 2 consecutive frames during training. From the image discriminator architecture in  [tsai2018learning, vu2019advent]

, we replace the first 2D convolution layer with 3D convolution layer of temporal stride 2. This modification allows the discriminator

to take sequence of predictions as input. We set to 0.7 for cutting out the occluded area.

Figure 4: Qualitative comparison with the baselines. We indicate yellow and red boxes for inaccurate and inconsistent prediction, respectively. We can see that previous approaches suffer from both wrong and temporally inconsistent prediction. Instead, our framework successfully resolves both issues. Best viewed in color.

5.2 Comparison with State-of-the-art

In  Table. 1 and  Table. 2, we report the quantitative adaptation performance on VIPER to Cityscapes-VPS. We compare proposed method with Source-only model, state-of-the-art image UDA-models including adversarial based [tsai2018learning, vu2019advent], self-training based [zou2018domain, zou2019confidence], and combined methods [mei2020instance].

For image-level evaluation, proposed method shows best mIoU 51.91% with healthy margin. For example, we boost the state-of-the-art image-level UDA method, IAST, by +2.55% mIoU. It implies that per-frame semantic segmentation quality is improved by effectively leveraging spatio-temporal context in training time. Furthermore, as shown in the video metric and Fig. 4, we can clearly observe that our model produce both accurate and consistent prediction compared to the baselines. Note that proposed method introduce no computational overhead in the inference time.

5.3 Ablation Study on VAT

We show the contribution of proposed component to VAT performance in Table. 3.

Effectiveness of Tube Matching Loss.

We first investigate the impact of the tube matching loss. In model-(4), We add tube matching loss on top of the image-level adaptation model(model-(2)). As improved image and video metrics indicate, we confirm that tube-matching loss helps to learn spatio-temporal knowledge as an extra constraint. For more comprehensive insight, we further report the result of model-(3) that involve image-level degradation of tube matching loss (e.g. ). In terms of , model-(3) presents even worse performance than the simple image-level adaptation model. It implies that the improvements are truly based on the tube-level design of loss, not just from using the image-level dice coefficient.

Effectiveness of Sequence Discriminator.

We also explore the efficacy of the sequence discriminator. Our final VAT model in Table. 3, further adopts sequence discriminator instead of image discriminator. We empirically confirm that the sequence discriminator allows better adaptation by additionally leveraging temporal information. As a result, our VAT model achieves 1.5 mIoU and 1.0 over the prior adversarial method, which provides better pseudo labels to the self-training phase.

Dice Adv. Metric
Method Single Tube Img. Vid. mIoU
(1) 34.02 21.17
(2) 43.95 29.78
(3) 44.03 29.51
(4) 44.75 30.27
Ours 45.53 30.81
Table 3: Ablation study on video adversarial training. We empirically verify the effectiveness of proposed tube matching loss and sequence discriminator.
Method ST Agg. Reg. Ref. mIoU
Video Adversarial Training 45.53 30.81
Class Balanced [zou2018domain] 47.63 32.26
Instance Adaptive [mei2020instance] 47.98 32.31
Clip Adaptive 49.18 33.48
+Aggregated prediction 49.52 34.02
+Regularization 51.52 35.78
+Temporal Refinement 51.90 36.43
Table 4: Ablation study on video self-training. “Agg.”, “Reg.” and “Ref.” denote temporally aggregated prediction, regularization and online pseudo label refinement, respectively.
Adversarial Training Self-Training mIoU
(2) Ours 48.58 33.95
(3) Ours 50.18 34.83
(4) Ours 50.70 35.66
Ours Ours 51.90 36.43
Table 5: Importance of video adversarial training. We run our full VST phase on the different adversarial models.
Figure 5: Visual comparisons on pseudo labels. We can clearly observe that proposed method generate more accurate and consistent pseudo labels over the baselines.
Figure 6: Pseudo labels generated from different adversarial models. It is obvious that the quality of generated pseudo labels depend on the pre-trained adversarial model since ours with VAT model shows better visual effect on pseudo labels than ours with IAT model.

5.4 Ablation Study on VST

Effectiveness of Proposed Pseudo Label Generation Strategy.

Here, we study the efficacy of clip adaptive pseudo label generation. We compare it with the previous state-of-the-art image-based pseudo label generation approaches: class-balanced [zou2018domain], instance adaptive [mei2020instance]. As shown in the Table. 6, the model trained with the proposed clip adaptive pseudo labels shows the best performance. Plus, adopting the presented ‘temporally aggregated prediction’ further improves the performance. This implies that exploiting the additional temporal information in the video is crucial for adapting video models. Finally, we also provide qualitative results in the Fig. 5. We clearly see that our clip adaptive pseudo label generation generates more accurate and temporally consistent labels. For example, both “people” and “car” classes are overlooked in previous approaches but densified with our method.

Importance of video adversarial training.

Here we study how different first-phase adversarial methods affect the final performance. We compared them using the same video self-training in the second-phase. The results are summarized in Table. 5. Without any proposals in VAT, the final performance drop by a significant margin, showing the efficacy of our proposals. The best performance is achived when all the proposals (i.e., tube matching and video discriminator) are used together. This again shows the importance of exploring the video context in designing the adaptation technique. We also provide qualitative pseudo labels when using IAT and VAT in Fig. 6. We obtain much dense and accurate pseudo labels with the video-level adversarial pre-training.

Visual analysis of online pseudo label refinement.

We illustrate the role of online pseudo label refinement in the Fig. 7. It removes noisy pseudo labels by checking temporal consensus. For example, original pseudo labels mislabel bicycle regions as a person, while the refinement process successfully eliminates such parts. As a result, this process prevents the model from constantly fitting to noise, showing the better performance (Table. 6).

6 Conclusion

In this paper, we explore a new domain adaptation task, unsupervised domain adaptation for video semantic segmentation. We present a novel framework that consists of two video-specific domain adaptive training: VAT and VST. In the first step, we distill source knowledge using standard cross entropy loss and the newly presented tube matching loss jointly. Meanwhile, the VAT is applied for the feature alignment. In the second step, the video self-training is used for target data learning. The temporal information is leveraged to generate dense, accurate pseudo labels. We significantly outperform all the previous strong image-based UDA baselines on viper-to-cityscape scenario. We hope many follow-up studies being presented in the future based on our proposals and results.


In this supplementary section, we provide,

  1. [label=.]

  2. Details of a new UDA framework that supports multiple popular UDA approaches in a unified platform,

  3. The impact of online pseudo label refinement,

  4. Additional analyses of the pseudo labels,

  5. Limitations and discussions of the proposals,

  6. More qualitative adaptation results on Cityscape-VPS.

Appendix A New UDA framework

One of the paper’s main contributions is presenting a new UDA framework that supports multiple popular and contemporary UDA approaches for semantic segmentation. Specifically, our framework contains five strong UDA baselines in a unified platform. A list is given as follows,

a.1 Adversarial-based

  • AdaptSegNet [tsai2018learning]

    : adapting structured output semantic segmentation logits, proposed in 2018.

  • Advent [vu2019advent]: adapting entropy maps, proposed in 2019.

a.2 Self-training-based

  • CBST [zou2018domain]: class-balanced pseudo label generation, proposed in 2018.

  • CRST [zou2019confidence]: confidence regularized self-training, proposed in 2019.

  • IAST [mei2020instance]: instance adaptive pseudo label generation, proposed in 2020.

We believe this framework is by far the first complete UDA toolbox. Upon this framework, we developed our new proposals, video adversarial training (VAT) and video self-training (VST), and enabled fair apple-to-apple comparisons. The codes and models will be released.

Appendix B The impact of online pseudo label refinement

In the main paper, we exploit the temporal information in pseudo label generation so that the temporally inconsistent labels are cut-out. Another reasonable baseline is to directly borrow the reference information to fill-in the missing labels in current frame. In practice, we propagate the pseudo label information in the reference frame () to the target frame () using flow-based warping function (). The overall procedure can be formulated as,


where is pseudo label region of current image, is the non-occluded map from to and * represents element-wise multiplication.

The comparison results are summarized in Table. 6. We observe that fill-in approach is inferior to the baseline, whereas the proposed cut-out algorithm shows improvement over it. We see this phenomenon comes from the in-nature noise in the pseudo label, and thus cut-out-based regularization is better than fill-in-based label accumulation. To back our claim, we also provide qualitative results in Fig. 7. We can observe the general tendency that our cut-out based label refinement less accumulates erroneous labels than the fill-in based approach.

Method mIoU
Base 51.52 35.78
Base+fill-in 50.0 34.78
Base+cut-out (ours) 51.91 36.43
Table 6: Comparative performance on online refinement. We experiment different online refinement methods on top of proposed VAT and VST.
Figure 7: Visualization of pseudo-labels w/o and w/ the online refinements. The proposed cut-out refinement successfully eliminate noise on the pseudo labels by checking temporal consensus. However, fill-in based method makes additional noise on pseudo labels.
Figure 8: Visualization of cause and effect in our failure case. Our model is comparably weak to certain class, which could be originated from pseudo label generation process. Detail analysis is on Sec. D

Appendix C Additional analysis of Pseudo Labels

In the main paper, we already highlight the effectiveness of the proposed pseudo label generation strategy by showing quantitative and qualitative comparisons in Table 4 and Figure 5. Besides, we measure mIoU of different pseudo labels and name it as P-mIoU. For fair comparisons, We tune the hyperparameter (see Preliminary in the main paper) of each method to have the same proportion of pseudo labels. The results are in the  Table. 7. We again observe that our proposal generates the most accurate pseudo labels regardless of the proportion. It implies that leveraging the additional temporal information is essential.

Proportion Methods
CB [zou2018domain] IA [mei2020instance] Ours
P-mIoU 0.3 67.8 67.9 68.8
0.4 64.8 66.4 67.1
Table 7: P-mIoU of different pseudo labels. CB and IA denote class balanced [zou2018domain] and instance adaptive [mei2020instance] pseudo label generation methods, respectively.

Appendix D Limitations and Discussions

In order to make the proposed video UDA framework more competitive and facilitate the future research, we point out a few specific items that call for continuing efforts:

  • FlowNet dependency. Currently, the overall pseudo label generation in our framework highly relies on the FlowNet [ilg2016flownet], which is trained on the external simulated data. This inevitably brings inferior adaptation results on certain classes that the FlowNet cannot capture well. For example, in Fig. 8, we see that the adaptation results of “Sidewalk” class is inferior to IAST framework [mei2020instance] (see Fig. 8 (d)-(f)), which originates from the imperfect pseudo labels (see Fig. 8 (a)-(c)). One possible strategy might be learn to adapt without the FlowNet. We expect deeper explorations in this directions in the future.

  • How to select the initial model for VST. As pointed out in Figure 6 of the main paper, the quality of the pseudo labels largely depends on the initial model. Even with the same adversarial methods, it is hard to select the best iteration without target domain labels due to the unstable training of adversarial methods. Thus, for the better final performance, it is unclear that what criteria to choose the model by. While this issue is an inherent limitation of “Adversarial-then-Self training” framework itself, it is also important for better video DA methods.

Appendix E More qualitative results

We release video results 111https://youtu.be/z-rBcY87XCw on test sets of cityscapes dataset [Cordts2016Cityscapes]. Our method shows much clear and consistent prediction compared to state-of-the-art image UDA method, confirming its robustness and effectiveness.