Video semantic segmentation, as an important research topic for applications such as robotics and autonomous driving, still remains largely unsolved. Current video segmentation methods mainly face two aspects of challenges: inefficiency and lack of labeled data. On the one hand, since frame-by-frame labeling of the video is time consuming, the existing data set contains only one annotated frame in each snippet, thus making the problem more challenging. On the other hand, to incorporate temporal information of the video, existing methods deploy feature aggregation modules to improve the segmentation accuracy, which leads to inefficiency during the inference phase.
Optical flow, which encodes the temporal consistency across frames in video, has been used to improve the segmentation accuracy or speed up the segmentation computation. For examples, the methods [20, 34, 27] reuse the features in previous frames to accelerate computation. However, doing so will result in a decrease in the accuracy of the segmentation, and such methods are not considered in this paper. On the other hand, the methods [7, 16, 8, 23, 11] model multiple frames by flow-guided feature aggregation or a sequence module for better segmentation performance, which increases computational cost. Our motivation is to use optical flow to exploit temporal consistency in the semantic feature space for training better models, with no cost in inference time.
Current video segmentation datasets such as  only annotate a small fraction of frames in videos. Existing methods focus on combining features of consecutive frames to achieve better segmentation performance. These methods can only use a small portion of frames in the video. Moreover, additional data is needed for training the feature aggregation module (FlowNet) in flow-guided methods .
To address the two challenges of video semantic segmentation, we propose a joint framework for semantic segmentation and optical flow estimation to fully utilize the unlabeled video data and overcome the problem of pre-computing optical flow. Semantic segmentation introduces semantic information that helps identify occlusion for more robust optical flow estimation. Meanwhile, non-occluded optical flow provides accurate pixel-level correspondences to guarantee the temporal consistency of the segmentation. These two tasks are related through temporal and spatial consistency in the designed network. Therefore, our model benefits from learning all the frames in the video without feature aggregation, which means that there is no extra calculation in inference. To the best of our knowledge, this is the first framework that joint learns these two tasks in an end-to-end manner.
We summarize our contributions as follow: (1) We design a novel framework for joint learning of video semantic segmentation and optical flow estimation with no extra calculation in inference. All the video frames can be used for training with the proposed temporally consistent constraints. (2) We design novel loss functions that handle flow occlusion in both two tasks, which improves the training robustness. (3) Our model makes the video semantic segmentation and optical flow estimation mutually beneficial and is superior to existing methods under the same setting in both tasks.
Video Segmentation. Video semantic segmentation considers temporal consistency of consecutive frames compared to semantic segmentation. Existing methods mainly fall into two categories. The first category aims to accelerate computation by reusing the features in previous frames. Shelhamer et al. proposed a Clockwork network  that adapts multi-stage FCN and directly reuses the second or third stage features of preceding frames to save computation. 
presented the Deep Feature Flow that propagates the high level feature from the key frame to current frame by optical flow learned in FlowNet.  proposed a network using spatially variant convolution to propagate features adaptively and an adaptive scheduler to ensure low latency. However, doing so will result in a decrease of accuracy, which is not considered in this paper.
Another category focuses on improving accuracy of segmentation by flow-guided feature aggregation or some sequence module. Our model falls into this category.  proposed to combine the CNN features of consecutive frames through a spatial-temporal LSTM module.  proposed a NetWarp module to combine the features wrapped from previous frames with flows and those from the current frame to predict the segmentation. 
proposed gated recurrent units to propagate semantic labels. proposed to learn from unlabeled video data in an unsupervised way through a predictive feature learning model (PEARL). However, such methods require additional feature aggregation modules, such as flow warping modules and sequence modules, which greatly increase the computational costs during the inference phase. Moreover, the feature aggregation modules of these methods can only process the annotated frame and several frames around it, while the rest of the frames are largely discarded in the training. In contrast, our method has two parallel branches for semantic segmentation and optical flow estimation, which reinforce each other in training but adds no extra calculation in inference. Furthermore, we can also leverage all video frames to train our model, with our temporally consistent constraint.
There are also other video segmentation methods with different settings.  applied a dense random field over an optimized feature space for video segmentation.  introduced densely-connected spatio-temporal graph on deep Gaussian Conditional Random Fields. 
estimates optical flow and temporally consistent semantic segmentation based on an 8-DoF piecewise-parametric model with a superpixelization of the scene. However, the iterative method based on superpixel cannot benefit from unsupervised data nor be optimized end-to-end. Our model can benefit from unsupervised data and be trained in an end-to-end deep manner, making the two tasks mutually beneficial. proposed to learn video object segmentation and optical flow in a multi-task framework, which focuses on segmenting instance level object masks. Both optical flow and object segmentation is learned in a supervised manner. In comparison, our task is semantic segmentation for the entire image and our optical flow is learned unsupervisedly. The two tasks cannot be directly compared.
directly compute dense flow prediction on every pixel through fully convolutional neural networks. PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. 
introduced a spatio-termporal video autoencoder based on an end-to-end architecture that allows unsupervised training for motion prediction.[15, 22, 25]
utilizes the Spatial Transformer Networks to warp current images and measures photometric constancy. [29, 14]
models occlusion explicitly during the unsupervised learning of optical flow. In this work, the occlusion mask is refined by introducing the semantic information in our proposed approach. Moreover, the unsupervised optical flow estimation framework can be further extended to estimate monocular depth, optical flow and ego-motion simultaneously in an end-to-end manner.  proposed a cascaded classification framework that accurately models 3D scenes by iteratively refining semantic segmentation masks, stereo correspondences, 3D rigid motion estimates, and optical flow fields.
Our framework, EFC model (Every Frame Counts), learns video semantic segmentation and optical flow estimation simultaneously in an end-to-end manner. In the following, we first give an overview of our framework and then describe each of its components in detail.
An overview of our EFC model is illustrated in Figure 2. The input to our model is a pair of images , randomly selected from near-by video frames with . If either or has semantic labels, we can update weights of the network by supervised constraints with semantic labels as well as unsupervised constraints from near-by frame correspondence. It propagates semantic information across frames, and jointly optimize the semantic component and optical flow component to reinforce each other. Otherwise, only unsupervised consistency information can be utilized, and our network can benefit from the improvement in the optical flow component.
Specifically, our network consists of the following three parts, i.e., the shared encoder part, the segmentation decoder part and the flow decoder part. The shared encoder contains layers 1-3 of ResNet . It is helpful since semantic and flow information exchange among the representation, increasing the representation ability compared to . The semantic decoder is adopted from layer 4 of ResNet if semantic label exists. The flow decoder combines intermediate feature from frame and via a correlation layer following  to predict optical flow. A smoothness loss on flow result is applied to improve flow quality.
To enable end-to-end cross frame training without optical flow label, we design a temporal consistency module. It can warp both input image pairs and intermediate feature pairs via the predicted flow and regresses warping error as the photometric loss and temporal consistency loss accordingly. To further increase robustness with heavy occlusion, where the predicted optical flow is invalid, we introduce the occlusion handling module with an occlusion aware loss. The occlusion mask is also learned end-to-end and improves with better predicted optical flow. In the following, we will introduce each module of our model in detail.
Temporally Consistent Constraint
Photometric consistency is usually adopted in optical flow estimation, where the first frame is warped to the next by optical flow and the warping loss can be used for training the network. In this work, we generalize the photometric loss to the feature domain. As the convolution neural network is translation invariant, the feature maps of adjacent frames should also follow the temporally consistent constraint.
More specifically, for a pair of video frames and , we feed them into the shared encoder network to extract their feature maps and . Since we learn both forward and reverse optical flows simultaneously, we then warp to by flow so that is expected to be consistent with feature map . Formally, can be obtained by
where we adopt the differentiable bilinear interpolation for warping. Note that the warping direction is different from the flow direction. However, the flow can be invalid in occluded regions. So we estimate the occlusion mapsand by checking if one pixel has a corresponding pixel in the adjacent frame. With the occlusion maps, we avoid penalizing the pixels in the occluded regions. The temporal consistency loss is thus defined as:
where is the feature at location . Notice that we take warping constraints in both directions for training.
The temporal consistency loss introduces a temporal regularization on the feature space, thus allowing our model to be trained with unlabeled video data. When the label is unavailable, our model can still benefit from the temporal consistency constraint.
Our model learns occlusion in a self-supervised manner. The occlusion defined here is a general term. By occlusion we refer to the pixels that are photometric inconsistent in two given frames, which can be caused by real occlusion by objects, in-and-out of image, change of view angle or so. The occlusion and the optical flow estimation network share most of the parameters. For each block in non-occluded flow branch, we add two convolutional layers with very few channels and a sigmoid layer for occlusion estimation. By backward optical flow , we can calculate the correspondence between the two frames in pixel-level. We decompose optical flow into vertical part and horizontal part. Then we have:
The occlusion mask for the backward flow can be formulated as: if there is a corresponding pixel in ( & ), otherwise . Then cross entropy with a penalty is used for occlusion estimation. The network mimics , and produces finer masks by our loss function :
Since we do not calculate the consistency loss of the occlusion region, the network tends to predict more occlusion regions. So the second penalty term is used to prevent excessive occlusion prediction. The larger is, the greater penalty for the occlusion region, and the smaller the occlusion region predicted. We tried different values between 0 and 1, and found that 0.2 is the best.
Optical Flow Estimation
Similar to [31, 15, 29], optical flow can be learned in a self-supervised manner. More specifically, the first frame can be warped to the next frame by the predicted optical flow, and the photometric consistency and motion smoothness are exploited for training. Photometric consistency is to reconstruct the scene structure between two frames and motion smoothness is to filter out erroneous predictions and preserve sharp details. In this work, we observe that semantic information can be leveraged by joint training to help estimation of optical flow.
As shown in Figure 2, the semantic maps introduce semantic information on the likely physical motion of the associated pixels. Besides, we generate error masks which point out the inaccurate regions of the optical flow for robust optical flow estimation. As illustrated in Figure 3, we first calculate an inconsistent mask between our two branches, where is the warped segmentation prediction with bilinear interpolation. Then we define the error mask as:
The inconsistent mask of two segmentation maps should contain the occlusion mask and the offset due to in-accurate optical flow. To unify these two masks, we simply double the weight of the error mask region and ignore the occlusion mask region during optical flow learning. Our photometric loss can be calculated with the following equation:
where is a warped image, is the per pixel structural similarity index measurement , denotes the loss map, which indicates the weight to penalize at different locations. Here we adopt a linear combination of two common metrics for estimating similarity of the original image and the warped one. Intuitively, the pixels perfectly matched indicate the estimated flow is correct and get less penalized in the photometric loss. is taken to be 0.85 as in . Following [15, 31], The smoothness loss is defined as:
is the vector differential operator. Note that both the photometric and smoothness losses are calculated on multi-scale blocks and two directions.
For the frames that have ground truths , we use the standard log-likelihood loss for semantic segmentation:
To summarize, our final loss for the entire framework is:
where , , and denote the weights for multiple losses. Our entire framework is thus trained end-to-end.
Dataset and Setting
Datasets We evaluate our framework for video semantic segmentation on the Cityscapes  and CamVid datasets . We also report our competitive results for optical flow estimation on the KITTI dataset .
Cityscapes  contains 5,000 sparsely labeled snippets collected from 50 cities in different seasons, which are divided into sets with numbers 2,975, 500, and 1,525 for training, validation and testing. Each snippet contains 30 frames, and only the 20th frame is finely annotated in pixel-level. 20,000 coarsely annotated images are also provided.
CamVid  is the first collection of videos with object class semantic labels, it contains 701 color images with annotations of 11 semantic classes. We follow the same split in [17, 23] with 367 training images, 100 validation images and 233 test images.
is a real-world computer vision benchmark dataset with multiple tasks. The training data we use here is similar to, where the official training images are adopted as testing set. All the related images in the 28 scenes covered by testing data are excluded. Since there are no segmentation labels on our training set, we generate some coarse segmentation results as the segmentation ground truths through a model trained on Cityscapes.
Evaluation Metrics We report mean Intersection-over-Union (mIoU) scores for semantic segmentation task on Cityscapes and CamVid datasets. The optical flow performance for the KITTI dataset is measured by the average end-point-error (EPE) score.
Implementation Details Our framework is not limited to specific CNN architectures. In our experiments, we use the original PSPNet  and the modified FlowNetS  as the baseline network unless otherwise specified. The FlowNetS is modified as follows: (1) share the encoder with PSPNet. (2) add two convolution layers for occlusion estimation with 32 and 1 channels, respectively. The loss weights are set to be for all experiments.
During training, we randomly choose ten pairs of images with from one snippet, five of which contain images with ground truths. The training images are randomly cropped to
. We also perform random scaling, rotation, flip and other color augmentations for data augmentation. The network is optimized by SGD, where momentum and weight decay are set to 0.9 and 0.0001 respectively. We take a mini-batch size of 16 on 16 TITAN Xp GPUs with synchronous Batch Normalization. We use the ‘poly’ learning rate policy and set base learning rate to 0.01 and power to 0.9, as in. The iteration number for training process is set to 120K.
|ResNet50 + PSPNet||76.20|
|+ + OM||77.79|
|+ + OM||78.07|
|+ + OM + UD||78.44|
To further evaluate the effectiveness of the proposed components, i.e., the joint learning, the temporally consistent constraint, the occlusion masks, and the unlabeled data, we conduct ablation studies on both the segmentation and optical flow tasks. All experiments use the same training setting.
For video segmentation, we make comparisons to five simplified versions on the Cityscapes validation set: (1) – temporally consistent constraint on a single pair of images with the fixed pre-trained FlowNetS. (2) – temporally consistent constraint without the occlusion mask on a single pair of images. (3) + OM – temporally consistent constraint with the occlusion mask on a single pair of images. (4) + OM – temporally consistent constraint with the occlusion mask on randomly selected five pairs of images. (5) + OM + UD – our full EFC model with unlabeled data.
|UL + OE||7.23||8.72|
|UL + TC||4.94||8.84|
|UL + OE + TC||4.51||7.79|
|Method||C||IoU cls||IoU cat|
|Dilation10 + GRFP nilsson2016semantic||67.8||86.7|
|Dilation10 + EFC (Ours)||68.7||87.3|
|PSP + EFC (Ours)||80.2||90.9|
|PSP_CRS + NetWarp gadde2017semantic||✓||80.5||91.0|
|PSP_CRS + GRFP nilsson2016semantic||✓||80.6||90.8|
|PSP_CRS + EFC (Ours)||✓||81.0||91.2|
|+ EFC (Ours)||✓||82.7||92.1|
|+ VPLR zhu2019improving||✓||83.5||92.2|
The ablation study results for segmentation are presented in Table 1. It can be seen that: (1) The performance continuously increases when more components are used for video segmentation, showing the contribution of each part. (2) Compared with the fixed FlowNetS, joint learning with the optical flow benefits the video segmentation, which shows the close relationship between these two tasks. (3) The temporally consistent constraint has made huge improvements (a percentage of 1.3) to video segmentation, even without the use of occlusion mask. (4) The improvements achieved by occlusion mask show that modeling of occlusion regions benefits the video segmentation. (5) Both the use of more labeled data pairs and unlabeled data clearly lead to performance improvements, which provides evidence that our EFC model takes full advantage of video information.
For optical flow estimation, we make comparisons to five versions of our model: (1) UL – unsupervised learning of only the flow branch with the smooth loss and the photometric loss; (2) UL + OE – adding occlusion estimation () without the occlusion mask ; (3) UL + TC – adding the segmentation branch and the temporal consistency module; (4) UL + OE + TC – our model without the occlusion handling module; (5) EFC_full – our full model.
From Table 2, we can observe that: (1) Our model can learn in an unsupervised manner using only the optical flow branch. (2) The segmentation branch and temporal consistent constraints greatly facilitate the learning of optical flow. (3) A better occlusion estimation can further improve the performance of optical flow estimation.
Video Semantic Segmentation
We compare our video semantic segmentation model to the state-of-the-art alternatives on the challenging Cityscapes and CamVid datasets.
Cityscapes To validate the robustness of the proposed method on different network architectures, we used Dilation10 , PSPNet  and DeepLabv3+  as backbone network for the segmentation branch, respectively. In Table 3 we show the quantitative comparison with a number of state-of-the-art video segmentation models.
We observe that: (1) With DeepLabv3+, PSPNet and Dilation10 as our backbones, our model are able to improve the mIoU score by 0.6, 1.8 and 2.1 respectively. Notice that our approach can be applied to any image semantic segmentation model for more accurate semantic segmentation. (2) VPLR  first pre-trained on the Mapillary dataset, which contains 18,000 street-level scenes annotated images for autonomous driving. However, our model benefits from unlabeled data without the need of additional labeling costs. The performance can be further improved when we use coarsely annotated images. (3) Our segmentation model benefits from the spatial-temporal regularization in the feature space, thus there is no extra cost during the inference phase. All the other methods require additional modules and computational costs. Qualitative comparison is shown in Figure 4.
CamVid We evaluate our method on the CamVid dataset and compare it with multiple video semantic segmentation methods. The comparative results are given in Table 4. Our model achieves the best result under the same setting.
|+ STFCN fayyaz2016stfcn||65.9|
|+ GRFP nilsson2016semantic||66.1|
|+ FSO kundu2016feature||66.1|
|+ VPN jampani2017video||66.7|
|+ NetWarp gadde2017semantic||67.1|
|+ EFC (ours)||67.4|
|Back2Future janai2018unsupervised||R + K||3.22||6.59|
|SelFlow ||S + K||–||4.84|
Optical Flow Estimation
To quantify how optical flow estimation benefits from the semantic segmentation, we evaluate the estimated flow on the KITTI dataset. Both supervised and unsupervised methods are included. As shown in Table 5, our model not only outperforms the existing unsupervised learning methods, but also yields comparable results with the Flownet2  which is trained on FlyingChairs and FlyingThings3D datasets. Following the common practice in [25, 31, 29, 22, 18], we use no additional data and discard the whole sequence as long as it contains any test frames, while [14, 21] use the RoamingImages dataset and the Sintel dataset for pre-training, respectively. Besides, they use PWC-Net  as the base model, which is powerful than FlowNetS.
In this paper, we propose a novel framework (EFC) for joint estimation of video semantic segmentation and optical flow. We observe that semantic segmentation introduces semantic information and helps model occlusion for more robust optical flow estimation. Meanwhile, non-occluded optical flow provides accurate pixel-level temporal correspondences to guarantee the temporal consistency of the segmentation. Moreover, we address the insufficient data utilization and the inefficiency issues through our framework. Extensive experiments have shown that our approach outperforms the state-of-the-art alternatives under the same settings in both tasks.
Zhiwu Lu is partially supported by National Natural Science Foundation of China (61976220, 61832017, and 61573363), and Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098). Ping Luo is partially supported by the HKU Seed Funding for Basic Research and SenseTime’s Donation for Basic Research.
-  (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: Dataset and Setting, Dataset and Setting.
-  (2018) Deep spatio-temporal random fields for efficient video segmentation. In CVPR, pp. 8915–8924. Cited by: Related Work.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: Video Semantic Segmentation, Table 3.
-  (2017) Segflow: joint learning for video object segmentation and optical flow. In ICCV, pp. 686–695. Cited by: Related Work.
The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: Introduction, Dataset and Setting, Dataset and Setting.
-  (2015) Flownet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. Cited by: Related Work, Related Work, Dataset and Setting.
-  (2016) STFCN: spatio-temporal fcn for semantic video segmentation. arXiv preprint arXiv:1608.05971. Cited by: Introduction, Related Work.
-  (2017) Semantic video cnns through representation warping. CoRR, abs/1708.03088 8, pp. 9. Cited by: Introduction, Related Work.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: Dataset and Setting, Dataset and Setting.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Framework Overview.
-  (2016) Joint optical flow and temporally consistent semantic segmentation. In ECCV, pp. 163–177. Cited by: Introduction, Related Work.
-  (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In CVPR, Vol. 2, pp. 6. Cited by: Related Work, Framework Overview, Optical Flow Estimation.
-  (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems, pp. 2017–2025. Cited by: Related Work.
-  (2018) Unsupervised learning of multi-frame optical flow with occlusions. In ECCV, pp. 690–706. Cited by: Related Work, Optical Flow Estimation.
-  (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In ECCV, pp. 3–10. Cited by: Related Work, Optical Flow Estimation, Optical Flow Estimation.
-  (2017) Video scene parsing with predictive feature learning. In ICCV, pp. 5581–5589. Cited by: Introduction, Related Work, Table 3.
-  (2016) Feature space optimization for semantic video segmentation. In CVPR, pp. 3168–3175. Cited by: Related Work, Dataset and Setting.
-  (2019) Bridging stereo matching and optical flow via spatiotemporal correspondence. In CVPR, pp. 1890–1899. Cited by: Optical Flow Estimation.
-  (2019) Dfanet: deep feature aggregation for real-time semantic segmentation. In CVPR, pp. 9522–9531. Cited by: Table 3.
-  (2018) Low-latency video semantic segmentation. In CVPR, pp. 5997–6005. Cited by: Introduction, Related Work, Table 3.
SelFlow: self-supervised learning of optical flow. In CVPR, pp. 4571–4580. Cited by: Optical Flow Estimation, Table 5.
-  (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In AAAI Conference on Artificial Intelligence, Cited by: Related Work, Optical Flow Estimation.
-  (2018) Semantic video segmentation by gated recurrent flow propagation. In CVPR, pp. 6819–6828. Cited by: Introduction, Introduction, Related Work, Dataset and Setting.
-  (2015) Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309. Cited by: Related Work.
Unsupervised deep learning for optical flow estimation.. In AAAI Conference on Artificial Intelligence, Vol. 3, pp. 7. Cited by: Related Work, Optical Flow Estimation.
-  (2017) Cascaded scene flow prediction using semantic segmentation. In International Conference on 3D Vision (3DV), pp. 225–233. Cited by: Related Work.
-  (2016) Clockwork convnets for video semantic segmentation. In ECCV, pp. 852–868. Cited by: Introduction, Related Work.
-  (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In CVPR, pp. 8934–8943. Cited by: Related Work, Optical Flow Estimation.
-  (2018) Occlusion aware unsupervised learning of optical flow. In CVPR, pp. 4884–4893. Cited by: Related Work, Optical Flow Estimation, Optical Flow Estimation.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: Optical Flow Estimation.
-  (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In CVPR, Vol. 2. Cited by: Related Work, Optical Flow Estimation, Optical Flow Estimation, Figure 5, Dataset and Setting, Optical Flow Estimation, Optical Flow Estimation.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: Video Semantic Segmentation, Table 4.
-  (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: Framework Overview, Figure 4, Dataset and Setting, Dataset and Setting, Video Semantic Segmentation, Table 3.
-  (2017) Deep feature flow for video recognition. In CVPR, pp. 3. Cited by: Introduction, Related Work.
-  (2019) Improving semantic segmentation via video propagation and label relaxation. In CVPR, pp. 8856–8865. Cited by: Video Semantic Segmentation.