Video semantic segmentation is an essential task for the analysis and understanding of videos. Recent efforts largely focus on supervised video segmentation by learning from fully annotated data, but the learnt models often experience clear performance drop while applied to videos of a different domain. This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR) for consecutive frames of target-domain videos. DA-VSN consists of two novel and complementary designs. The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning. The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames. Extensive experiments demonstrate the superiority of our proposed domain adaptive video segmentation network which outperforms multiple baselines consistently by large margins.READ FULL TEXT VIEW PDF
Panoptic segmentation unifies semantic segmentation and instance segment...
Temporal consistency is crucial for extending image processing pipelines...
Recent progresses in domain adaptive semantic segmentation demonstrate t...
Are we ready to segment consumer stereo videos? The amount of this data ...
Weakly supervised instance segmentation reduces the cost of annotations
Benefited from considerable pixel-level annotations collected from a spe...
An important aspect of video understanding is the ability to predict the...
Video semantic segmentation aims to assign pixel-wise semantic labels to video frames, and it has been attracting increasing attention as one essential task in video analysis and understanding [19, 45, 15, 37, 52]
. With the advance of deep neural networks (DNNs), several studies have been conducted in recent years with very impressive video segmentation performance[57, 33, 20, 27, 30, 31, 25, 39]. However, most existing works require large amounts of densely annotated training videos which entail a prohibitively expensive and time-consuming annotation process [3, 14]. One approach to alleviate the data annotation constraint is to resort to self-annotated synthetic videos that are collected with computer-generated virtual scenes [54, 23], but models trained with the synthesized data often experience clear performance drops while applied to videos of natural scenes largely due to the domain shift as illustrated in Fig. 1.
Domain adaptive video segmentation is largely neglected in the literature despite its great values in both research and practical applications. It could be addressed from two approaches by leveraging existing research. The first approach is domain adaptive image segmentation [71, 62, 70, 50, 65] which could treat each video frame independently to achieve domain adaptive video segmentation. However, domain adaptive image segmentation does not consider temporal information in videos which is very important in video semantic segmentation. The second approach is semi-supervised video segmentation [48, 69, 5] that exploits sparsely annotated video frames for segmenting unannotated frames of the same video. However, semi-supervised video segmentation was designed for consecutive video frames of the same domain and does not work well in domain adaptive video segmentation which usually involves clear domain shifts and un-consecutive video frames of different sources.
In this work, we design a domain adaptive video segmentation network (DA-VSN) that introduces temporal consistency regularization (TCR) to bridge the gaps between videos of different domains. The design is based on the observation that video segmentation model trained in a source domain tends to produce temporally consistent predictions over source-domain data but temporally inconsistent predictions over target-domain data (due to domain shifts) as illustrated in Fig. 1. We designed two complementary regularization modules in DA-VSN, namely, cross-domain TCR (C-TCR) and intra-domain TCR (I-TCR). C-TCR employs adversarial learning to minimize the discrepancy of temporal consistency between source and target domains. Specifically, it guides target-domain predictions to have similar temporal consistency of source-domain predictions which usually has decent quality by learning from fully-annotated source-domain data. I-TCR instead works from a different perspective by guiding unconfident target-domain predictions to have similar temporal consistency as confident target-domain predictions. In I-TCR, we leverage entropy to measure the prediction confidence which works effectively across multiple datasets.
The contributions of this work can be summarized in three major aspects. First, we proposed a new framework that introduces temporal consistency regularization (TCR) to address domain shifts in domain adaptive video segmentation. To the best of our knowledge, this is the first work that tackles the challenge of unsupervised domain adaptation in video semantic segmentation. Second, we designed inter-domain TCR and intra-domain TCR that improve domain adaptive video segmentation greatly by minimizing the discrepancy of temporal consistency across different domains and different video frames in target domain, respectively. Third, extensive experiments over two challenging synthetic-to-real benchmarks (VIPER 
Cityscapes-Seq and SYNTHIA-Seq  Cityscapes-Seq) show that the proposed DA-VSN achieves superior domain adaptive video segmentation as compared with multiple baselines.
Video semantic segmentation aims to predict pixel-level semantics for each video frame. Most existing works exploit inter-frame temporal relations for robust and accurate segmentation [28, 68, 20, 38, 64, 34, 30, 25, 39]. For example, [68, 20] employs optical flow  to warp feature maps between frames.  leverages inter-frame feature propagation for efficient video segmentation with low latency.  presents an adaptive fusion policy for effective integration of predictions from different frames.  distributes several sub-networks over sequential frames and recomposes the extracted features via attention propagation.  presents a compact network that distills temporal consistency knowledge for per-frame inference.
In addition, semi-supervised video segmentation has been investigated which exploits sparsely annotated video frames for segmenting unannotated frames of the same videos. Two typical approaches have been studied. The first approach is based on label propagation that warps labels of sparsely-annotated frames to generate pseudo labels for unannotated frames via patch matching [1, 4], motion cues [58, 69] or optical flow [47, 68, 48, 17]. The other approach is based on self-training that generates pseudo labels through a distillation across multiple augmentations .
Both supervised and semi-supervised video segmentation work on frames of the same video or same domain that have little domain gaps. In addition, they both require a certain amount of pixel-level annotated video frames that are prohibitively expensive and time-consuming to collect. Our proposed domain adaptive video segmentation exploits off-the-shelf video annotations from a source domain for the segmentation of videos of a different target domain without requiring any annotations of target-domain videos.
Domain adaptive video classification has been explored to investigate domain discrepancy in action classification problem. One category of works focuses on the specific action recognition task that aims to classify a video clip into a particular category of human actions via temporal alignment, temporal attention [49, 13], or self-supervised video representation learning [46, 13]. Another category of works focus on action segmentation that simultaneously segments a video in time and classifies each segmented video clip with an action class via temporal alignment  or self-supervised video representation learning .
This work focuses on a new problem of domain adaptive semantic segmentation of videos, a new and much more challenging domain adaptation task as compared with domain adaptive video classification. Note that existing domain adaptive video classification methods do not work for the semantic segmentation task as they cannot generate pixel-level dense predictions for each frame in videos.
Domain adaptive image segmentation has been widely investigated to address the image annotation challenge and domain shift issues [7, 65]. Most existing methods take two typical approaches, namely, adversarial learning based [24, 59, 62, 60, 40, 26, 50, 21] and self training based [71, 70, 35, 36, 65, 32, 67, 44]. The adversarial learning based methods perform domain alignment by adopting a discriminator that strives to differentiate the segmentation in the space of inputs [24, 66, 35, 12, 32], features [61, 24, 11, 66, 40] or outputs [59, 62, 41, 60, 26, 42, 63, 50, 21]. The self-training based methods exploit self-training to predict pseudo labels for target-domain data and then exploit the predicted pseudo labels to fine-tune the segmentation model iteratively.
Though a number of domain adaptive image segmentation techniques have been reported in recent years, they do not consider temporal information which is critically important in video segmentation. We introduce temporal consistency of videos as a constraint and exploit it to regularize the learning in domain adaptive video segmentation.
Given source-domain video frames with the corresponding labels and target-domain video frames without labels, the goal of domain adaptive video segmentation is to learn a model that can produce accurate predictions in target domain. According to the domain adaptation theory in , the target error in domain adaptation is bounded by three terms including a shared error of the ideal joint hypothesis on the source and target domains, an empirical source-domain error, and a divergence measure between source and target domains.
This work focuses on the third term and presents a domain adaptive video semantic segmentation network (DA-VSN) for minimizing the divergence between source and target domains. We design a novel temporal consistency regularization (TCR) technique for consecutive frames in target domain, which consists of two complementary components including a cross-domain TCR (C-TCR) component and an intra-domain TCR (I-TCR) component as illustrated in Fig. 2. C-TCR targets cross-domain alignment by encouraging target predictions to have similar temporal consistency as source predictions (accurate via supervised learning), while I-TCR aims for intra-domain adaptation by forcing unconfident predictions to have similar temporal consistency as confident predictions in target domain, more details to be described in the ensuing two subsections.
Note the shared error in the first term (the difference in labeling functions across domains) is usually small as proven in . The empirical source-domain error in the second term actually comes from the supervised learning in the source domain. For domain adaptive video segmentation, we directly adopt video semantic segmentation loss [68, 30, 25, 39] as the source-domain supervised learning loss.
Cross-domain temporal consistency regularization (C-TCR) aims to guide target predictions to have similar temporal consistency of source predictions which is determined by minimizing the supervised source loss and usually has decent quality. We design a dual-discriminator structure for optimal spatial-temporal alignment of source and target video-clips as illustrated in Fig. 3. As Fig. 3 shows, one discriminator focuses on spatial alignment of a single video frame of different domains (as in domain adaptive image segmentation) and the other discriminator focuses on temporal alignment of consecutive videos frames of different domains. Since inevitably involves spatial information, we introduce a divergence loss between and to force to focus on the alignment in temporal space.
For spatial alignment, the spatial discriminator aligns frame-level predictions and and its objective can be formulated as follows:
For temporal alignment, we forward the current frame and the consecutive frame to obtain the current prediction , and simultaneously forward and to obtain its consecutive prediction . The two consecutive predictions are stacked as which encode spatial-temporal information in source domain. This same process is applied to target domain which produces two consecutive target predictions that encode spatial-temporal information in target domain. The spatial-temporal discriminator then aligns and and its objective can be formulated as follows:
We enforce the divergence of the weights of and so that the spatial-temporal discriminator
can focus more on temporal alignment. The weight divergence of the two discriminators can be reduced by minimizing their cosine similarity as follows:
where is the number of convolutional layers in each discriminator, and are obtained by flattening the weights of the -th convolutional layer in the discriminators and , respectively.
The intra-domain temporal consistency regularization (I-TCR) aims to minimize the divergence between source and target domains by suppressing the temporal inconsistency across different target frames. As illustrated in Fig. 4, I-TCR guides unconfident target predictions to have similar temporal consistency as confident target predictions. Specifically, it first propagates predictions (of previous frames) forward by using frame-to-frame optical flow estimates, and then forces unconfident predictions in the current frame to be consistent with confident predictions propagated from the previous frame.
In the target domain, we first forward and to obtain the current prediction , and similarly forward and to obtain the previous prediction . We then adopt FlowNet  to estimate the optical flow from to . With the estimated frame-to-frame optical flow , the prediction can be warped to generate the propagated prediction .
To force unconfident predictions in the current frame to be consistent with confident predictions propagated from the previous frames, we employ an entropy function  to estimate the prediction confidence and use the confident prediction (, with low entropy) to optimize unconfident prediction (, with high entropy). Given and from target video frames , the I-TCR loss can be formulated as follows:
where is a signum function which returns 1 if the input is positive or 0 otherwise.
DA-VSN jointly optimizes the source-domain supervised learning (, SSL) and the target-domain unsupervised learning (, C-TCR and I-TCR) as follows:
where is the weight for balancing the supervised and unsupervised learning in source and target domains.
Datasets: Our experiments involve two challenging synthetic-to-real domain adaptive video semantic segmentation tasks: VIPER  Cityscapes-Seq  and SYNTHIA-Seq  Cityscapes-Seq. Cityscapes-Seq is a standard benchmark for supervised video semantic segmentation and we use it as the target-domain dataset. It contains and video sequences for training and evaluation, where each sequence consists of realistic frames with one ground-truth label provided for the frame. VIPER is used as one source-domain dataset, which contains synthesized video frames with segmentation labels produced by game engines. SYNTHIA-Seq is used as the other source-domain dataset, which contains synthesized video frames with automatically generated segmentation annotations. The frame resolution is , and in Cityscapes-Seq, VIPER and SYNTHIA-Seq, respectively.
Implementation Details: We adopt ACCEL  as the video semantic segmentation architecture. It consists of two segmentation branches, an optical flow network and a score fusion layer. Each segmentation branch generates single-frame prediction using Deeplab network  whose backbone is ResNet-101 
pre-trained on ImageNet. The optical flow network propagates prediction in the previous frame via FlowNet  and the score fusion layer adaptively integrates predictions in previous and current frames using a convolutional layer. All the discriminators in our experiments are designed as in DCGAN 
. For the efficiency of training and inference, we apply bicubic interpolation to resize every video frame in Cityscapes-Seq and VIPER toand
, respectively. Our experiments are built on PyTorch and the size of memory usage is below 12 GB. All the models are trained using SGD optimizer with a momentum of and a weight decay of . The learning rate is set at and has a polynomial decay with a power of . The balancing weights , , and
are set as 1, 1 and 0.001, respectively. The mean intersection-over-union (mIoU), the standard evaluation metric in semantic segmentation, is adopted to evaluate all methods.
We conduct comprehensive ablation studies to examine the effectiveness of our designs and Tables 1 and 2 show experimental results. Specifically, we trained seven models over the task VIPER Cityscapes-Seq including: 1) Source only that is trained with source data only by using supervised learning loss ; 2) ST that performs spatial alignment using adversarial loss and ; 3) STA that performs spatial-temporal alignment using adversarial loss and ; 4) JT that jointly trains SA and STA using , and ; 5) C-TCR that forces STA to focus on cross-domain temporal alignment by introducing the weight discrepancy loss into JT; 6) I-TCR that performs intra-domain adaptation using intra-domain temporal consistency regularization loss and ; and 7) DA-VSN that integrates C-TCR and I-TCR by using , and .
As shown in Table 1, both spatial alignment (SA) and spatial-temporal alignment (STA) outperform ‘Source only’ consistently, which verifies the effectiveness of the alignment in spatial and temporal spaces. Specifically, the performance gain of STA is larger than SA, which validates that temporal alignment is important in domain adaptive video segmentation by guiding the target predictions to have similar temporal consistency of source predictions. Joint training (JT) of STA and SA outperforms STA with a marginal performance gain, largely because the spatial-temporal alignment captures spatial alignment already. Cross-domain temporal consistency regularization (C-TCR) improves JT clearly by introducing weight discrepancy loss between discriminators in STA and SA which forces STA to focus on alignment in the temporal space. It also validates the significance of temporal alignment in domain adaptive video semantic segmentation. Similar to C-TCR, intra-domain TCR (I-TCR) outperforms ‘Source only’ with a large margin as shown in Table 2. This shows the importance of intra-domain adaptation that suppresses temporal inconsistency across target-domain frames. Lastly, DA-VSN produces the best video segmentation, which demonstrates that C-TCR and I-TCR complement with each other.
Since few works study domain adaptive video semantic segmentation, we quantitatively compare DA-VSN with multiple domain adaptation baselines [62, 71, 50, 70, 65] that achieved superior performance in domain adaptive image segmentation. We apply these approaches to the domain adaptive video segmentation task by simply replacing their image segmentation model by video segmentation model and performing domain alignment as in [62, 71, 50, 70, 65]. The comparisons are performed over two synthetic-to-real domain adaptive video segmentation tasks as shown in Tables 3 and 4. As the two tables show, the proposed method outperforms all the domain adaptation baselines consistently with large margins.
We also perform qualitative comparisons over the video segmentation task VIPER Cityscapes-Seq. We compare the proposed DA-VSN with the best-performing baseline FDA  as illustrated in Fig. 5. We can see that the qualitative results are consistent with the quantitative results in Table 3. Specifically, our method can generate better segmentation results with higher temporal consistency across consecutive video frames. The excellent segmentation performance is largely attributed to the proposed temporal consistency regularization which minimizes the divergence of temporal consistency across different domains and different target-domain video frames.
Feature Visualization: In the Section 4.3, we have demonstrated that the proposed DA-VSN has achieved superior performance in domain adaptive video segmentation as compared with multiple baselines. To further study the properties of DA-VSN, we use t-SNE  to visualize the distribution of target-domain temporal feature representations from different domain adaptive video segmentation methods, where the inter-class and intra-class variances are computed for quantitative analysis. As shown in Fig. 6, DA-VSN produces the most discriminative target-domain temporal features with the largest inter-class variance and the smallest intra-class variance, as compared with ‘Source only’ and FDA .
Complementary Studies: We also investigate whether the proposed DA-VSN can complement with multiple domain adaptation baselines [71, 50, 70, 65] (as described in Section 4.3) over domain adaptive video segmentation task. To conduct this experiment, we integrate our proposed temporal consistency regularization components (DA-VSN) into these baselines and Table 5 shows the segmentation results of the newly trained models. It can be seen that the incorporation of DA-VSN improves video segmentation performance greatly across all the baselines, which shows that DA-VSN is complementary to the domain adaptation methods that minimize domain discrepancy via image translation (, FDA ), adversarial learning (, AdvEnt ) and self-training (, CBST  and CRST ).
Different Video Segmentation Architectures: We further study whether DA-VSN can work well with different video semantic segmentation architectures. Three widely adopated video segmentation architectures (, Netwarp , TDNet  and ESVS ) are used in this experiments. As shown in Table 6, the proposed DA-VSN outperforms the ‘Source only’ consistently with large margins. This experiment shows that our method performs excellently with different video semantic segmentation architectures that exploits temporal relations via feature propagation , attention propagation , and temporal consistency constraint .
This paper presents a domain adaptive video segmentation network that introduces cross-domain temporal consistency regularization (TCR) and intra-domain TCR to address domain shift in videos. Specifically, cross-domain TCR performs spatial and temporal alignment that guides the target video predictions to have similar temporal consistency as the source video predictions. Intra-domain TCR directly minimizes the discrepancy of temporal consistency across different target video frames. Extensive experiments demonstrate the superiority of our method in domain adaptive video segmentation. In the future, we will adapt the idea of temporal consistency regularization to other video domain adaptation tasks such as video instance segmentation and video panoptic segmentation.
Naive-student: Leveraging semi-supervised learning in video sequences for urban scene segmentation.In European Conference on Computer Vision, pages 695–714. Springer, 2020.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10713–10720, 2020.