Semantic segmentation is to assign each pixel in scene a semantic class, which currently is an active research topic in computer vision. In recent years, image semantic segmentation has achieved unprecedented accuracy, benefited from the great progress of deep convolutional neural network (DCNN)[long2015fully] and various datasets (e.g., Cityscapes [Cordts_2016_CVPR] and CamVid [brostow2009semantic]). However, many real-world applications have strong demands to fast and accurate video semantic segmentation, e.g., robotics [kostavelis2015semantic], autonomous driving [teichmann2018multinet], and video surveillance [Liu_2017_CVPR]. Compared to images, videos consisting of consecutive frames involve much larger volume of data, and thus more efficient algorithms are generally required for video semantic segmentation.
A naive approach for video segmentation is to directly apply image segmentation model in a per-frame way. But such a kind of deployment is unacceptable in practice due to too heavy computation burden. Actually, the consecutive frames of a video are commonly similar in a large portion of content, and it is unnecessary to reprocess every pixel of a frame using image segmentation model [xu2018dynamic]. Then an intuitive idea to handle subsequent video frames is to reuse the extracted features from the previous frames when performing semantic segmentation on the current frame [zhu2017deep]. Naturally, feature propagation can be adopted to reduce the overall computational complexity.
In recent works, some feature propagation based methods have been proposed for video semantic segmentation, e.g., DFF [zhu2017deep], NetWarp [gadde2017semantic], DVSNet [xu2018dynamic], and Accel [jain2019accel]
. These methods first compute the optical flow between the key frame and current frame, and then produce the features of current frame by propagating the features of key frame under guidance of optical flow. Here the bilinear interpolation is usually used as the feature warping operator. The CNN-based flow estimation methods (e.g., FlowNet [dosovitskiy2015flownet, ilg2017flownet], FlowNet2.0 [ilg2017flownet]) are commonly preferred since they are easy to be embedded into video segmentation frameworks with end-to-end training. Evidently, the accuracy of optical flow estimation would determine the performance of feature propagation.
Despite decades of development, accurate optical flow estimation remains an open challenging problem [liu2019selflow]. In particular, the occlusion caused by scene motion makes the optical flow estimation ill-posed since no visual correspondences exist for the occluded pixels [neoral2018continual]. When the inaccurate optical flow is used in feature propagation, the produced features would get distorted and further lead to incorrect segmentation results. In addition, for small or slender areas of one semantic class (e.g., pedestrian, pole), a slight offset of predicted optical flow would cause sensible distortion, which is especially serious for long-distance propagation. We show the typical distortion phenomenons in Figure 1. In this work, we focus on distorted feature correction rather than pursuing more accurate optical flow estimation.
There are some works notice the distortion problem and propose to correct propagated features [zhu2017deep, jain2019accel]
. However, these works correct propagated features globally without differentiation on different areas, which probably undermines accurate parts in propagated features. Large portions of the propagated feature are flat areas of one semantic class (e.g., sky, road), which are accurate after long-distance propagation and even more accurate than correction cues obtained by limited computation budget. We show the typical wrong cases in Figure 2. Besides, we statistic on how may pixels are misrectified and correctly rectified, which is shown in Figure 3. Obviously, misrectified areas are unignorable comparing to correctly rectified areas. Thus, we need to prevent excessive feature correction and reuse propagated features maximally.
In this work, we propose distortion-aware feature correction for effective rectification on propagated features. In order to correct distorted areas and reuse others maximally, we need to locate distortion areas first. A naive approach is to extract features on the current frame via the same segmentation model and compare it with the propagated one. Distortion areas lie in the misalignment between two features. However, applying segmentation model on the current frame is unacceptable in our setting. To tackle this problem, we propose to transfer distortion patterns from feature space into image space. We observe that if we propagate frame images via the same optical flow used in feature propagation, the same distortion phenomenon would occurs in propagated frames. We show the typical distortion pattern transfer phenomenon in Figure 4. Benefited from this distortion pattern transfer strategy, we propose an extremely light-weight model, called DMNet, to predict distortion maps by comparing the distorted frame and the current frame.
Benefited from predicted distortion maps, we design a feature correction module (FCM) to extract correction cues from the current frame and perform an effective distortion correction on propagated features. Specifically, FCM utilizes a designed light-weight model, called CFNet, to perform correction cues extraction. Here, we introduce distortion map into the training of CFNet, which guides CFNet to focus on easily-distorted areas. Besides, FCM rectifies propagated features with extracted correction cues under the guidance of distortion maps, which not only corrects content in distortion areas but also reuses that in other areas maximally. It is experimentally shown that FCM can significantly boost the segmentation performance at a low price, especially for long-distance feature propagation.
The contributions of this work are summarized as
We propose an effective strategy to transfer distortion patterns from feature space into image space and design DMNet for accurate distortion map prediction.
We propose a novel feature correction module (FCM) to achieves feature propagation at both high accuracy and low price. Propagated features are rectified in the distorted areas but reused maximally in other areas.
We experimentally verify the effectiveness of our proposed FCM, and the results on Cityscapes and CamVid demonstrate the superiority of our method to the previous state-of-the-art methods.
Ii Related Work
Ii-a Image Semantic Segmentation
Benefited from the rapid development of DCNN [iandola2016squeezenet, simonyan2014very, he2016deep, szegedy2015going, huang2017densely], more and more semantic segmentation networks spring up. Specifically, fully convolutional network (FCN) [long2015fully] firstly proposed to use the convolutional layers to replace fully-connected layers, and consequently better performance can be achieved. Inspired by FCN, many extensions [zhao2017pyramid, wu2019wider, lin2017refinenet] have been proposed, which together advance image semantic segmentation. The dilated layers [chen2018deeplab, yu2015multi] are also introduced to replace the pooling layers, which can better balance the computation cost and receptive fields size. In addition, [chen2018deeplab, chen2014semantic, zheng2015conditional] propose to use the conditional random field (CRF) to refine the results of image segmentation. Recently, spatial pyramid pooling [he2015spatial] and atrous spatial pyramid pooling (ASPP) [chen2017rethinking, chen2018deeplab] are respectively used in PSPNet [zhao2017pyramid] and DeepLab [chen2018deeplab] to capture multi-scale contextual information. The great progress of image semantic segmentation offers the fundamental component for video semantic segmentation.
Ii-B Optical Flow Estimation
Optical flow is a representative pattern describing the apparent motion of objects in the video. Optical flow estimation is a fundamental task in video analysis domain with a long history. Classical variational approaches model optical flow estimation as an energy minimization problem [horn1981determining]. Such methods are effective for small motion, but tend to fail when displacements are large. Recent works use convolutional neural networks(CNNs) to improve sparse matching by learning an effective feature embedding [dosovitskiy2015flownet, ilg2017flownet, sun2018pwc].
Although current methods can obtain satisfactory optical flow in most common cases, it is still an open problem to calculate accurate optical flow for occlusion areas. Most methods detect occlusions by consistency check on the estimated forward and backward optical flow [chen2016full, sundaram2010dense] and then extrapolate into the occluded areas. But the used optical flow is already adversely affected by the occlusions. Evidently, the propagated features guided by the inaccurate flow would be severely distorted, especially for occlusion areas.
Actually, most video segmentation methods prefer the current state-of-the-art CNN networks [dosovitskiy2015flownet, ilg2017flownet, sun2018pwc, ranjan2017optical] because they are easily embedded with end-to-end training. However, the methods do not explicitly deal with occlusions, and consequently video segmentation would suffer from severe feature distortion. Thus, how to deal with occlusion areas efficiently and effectively when utilizing optical flow in feature propagation is crucial for the segmentation accuracy. In this work, we propose FCM to alleviate distortion phenomenon by explicitly rectifying propagated features.
Ii-C Video Semantic Segmentation
Different from static images, videos embody useful temporal information to exploit. So many previous works focus on modeling cross-frame relations to improve semantic segmentation accuracy. STFCN [fayyaz2016stfcn] utilizes a spatial-temporal LSTM over per-frame CNN features. Nilsson and Sminchisescu [nilsson2018semantic]
proposed to use gated recurrent units to propagate semantic labels. Gaddeet al. [gadde2017semantic] proposed to fuse the features warped from the key frame and that from the current frame. V2V [tran2016deep] utilizes a 3D CNN to perform a voxel-level prediction.
On the other hand, many works focus on reducing the computational cost of video semantic segmentation. Clockwork Net [shelhamer2016clockwork] updates different levels of feature maps with different frequencies. DFF [zhu2017deep] first employs an optical flow network to predict motion information, and then propagates the high-level features from the key frames to other frames. DVSNet [xu2018dynamic] builds a decision model to dynamically choose the key frames, which can achieve better balance between quality and efficiency. Li et al. [li2018low] proposes spatially variant convolution to adaptively fuse the features over time. Accel [jain2019accel] proposes a reference branch to extract high-quality segmentation from key frames and an update branch to efficiently extract low-quality segmentation from other frames, and then fuses them to improve the segmentation accuracy. TDNet [hu2020temporally] proposes to distribute several sub-networks over sequential frames and then recompose extracted features for segmentation via an attention propagation module.
Among the feature propagation based video segmentation methods, DFF [zhu2017deep] and Accel [jain2019accel] are more related to our proposed FCM. DFF [zhu2017deep] proposed scale field to capture error-prone areas by comparing the key and current frames and then rectifies propagated features via an element-wise multiplication. Accel [jain2019accel] utilizes a light-weight model to segment the current frame and then fuses it with the propagated one via a convolution operation. Obviously, these works all conduct global feature correction, rectifying not only distorted but also accurate areas. We provide visualization and experimental results to prove that global feature correction deteriorate propagated features. On the contrary, our proposed FCM conducts feature correction only in distorted areas under the guidance of predicted distortion maps, whose effectiveness is experimental verified.
Iii Our Approach
In this work, we focus on boosting the segmentation accuracy on non-key frame images under the optical flow-base feature propagation framework. Due to internal limitation (e.g., occlusion problem), optical flow always suffers inaccurate and results in distorted features, especially after long-distance propagation. To address this issue, we propose distortion-aware feature correction in this paper. For such a task, we need to determine 1) how to locate distortion areas, 2) how to extract correction cues, and 3) how to conduct feature correction. In the follows, we first introduce the framework of our proposed method. Then we elaborate on our proposed strategy for distortion pattern transfer and detection, and a feature correction module (FCM). Finally, we provide the details of training network.
Following the common flow-based feature propagation paradigm, we design our video semantic segmentation framework and propose feature correction module to tackle the feature distortion problem, as shown in Figure 5.
To be specific, each video frame is processed as the key or non-key frame. For key frames, we perform image semantic segmentation directly to get the results, and then propagate the intermediate features to the subsequent non-key frames. In our method, the features are propagated in a frame-by-frame way. That is, the features of current frame is obtained by propagating the features of the previous frame without correction, in which the predicted optical flow is used as guidance and the bilinear interpolate is adopt as the warping operator. Besides, we maintain a distorted frame image, which is propagated via the same optical flow as features. For non-key frames, we firstly predict a distortion map to locate distortion areas in propagated features, by passing the distorted frame and the current frame into our designed light-weight DMNet. After that, we pass the propagated features, predicted distortion map and current frame image into feature correction module (FCM) to rectify feature distortion. Finally, we conduct semantic segmentation on the corrected features to get the segmentation result.
In our implementation, we particularly adopt DeepLab-v3+ [chen2018encoder]
as the image semantic segmentation model due to its great performance in both accuracy and efficiency. We use the feature after classifier in segmentation model for propagation, which is commonly adopted in other works like DFF[zhu2017deep] and Accel [jain2019accel]. In addition, the modified FlowNet2-S [ilg2017flownet] is used for optical flow estimation.
Iii-B Distortion Map Prediction
In order to conduct effective but not excessive feature correction, distortion areas need to be located first. In this work, we propose to transfer distortion patterns from the distorted feature into image space by propagating frame image via the same optical flow used in feature propagation. Intuitively, distortion areas lie in the misalignment between the distorted frame and the current frame, which is an importance cue for distortion areas detection.
Actually, we can also transfer distortion patterns into a low-level feature space. By propagating a low-level feature and extracting one on the current frame, two features can also be used in distortion map prediction. Because of bring extra computation cost, we choose to conduct distortion prediction in image space and we provide ablation studies on other designs in the experimental session.
Following the design of siamese network, we propose DMNet for distortion map prediction by comparing two frame images. Figure 6
shows the architecture of DMNet. Identical feature extraction is conducted on two frames respectively. For efficient computation, the feature extractor only consists of several operation combination of separable convolution, batchnorm and ReLU, which has nearly ignorable computation cost. After feature extraction, DMNet computes the similarity s between two resulted features. We denoteand for the feature coming from the current frame and distorted frame respectively.
where and denote the spatial position, denotes the l
-normalization vector, and
measures the cosine similarity between two normalized vectors. Distortion areas lie in misalignment areas, which represents lower value on similarity map. Besides, Since s lies in range [-1,1], normalization is necessary before output the distortion map d:
In the training procedure of DMNet, we generate ground truth for supervised learning, which is shown in Figure7. Each training sample can be denoted by a 2-tuple (Frame, Frame), where k represents time interval. Firstly, we extract semantic features and of two frames by using a segmentation model. Then we propagate and Frame under the guidance of a predicted optical flow and obtain propagated feature and distorted frame Frame. Corresponding segmentation results are obtained by a argmax operation on and . Frame and Frame are passed into DMNet for distortion map prediction, while label is generated by a XOR operation on two obtained segmentation results and served as supervision signals for the distortion map predicted by DMNet.
Iii-C Feature Correction Module
In this session, we propose feature correction module (FCM) to correct distorted features effectively but not overly. For such a task, FCM needs to tackle two problems 1) how to extract correction cues from the current frame, and 2) how to correct distorted features. Towards this goal, we give FCM the ability of distortion awareness by introducing predicted distortion maps.
To achieve effective correction cue extraction from the current frame, we propose a light-weight CFNet and design a specific learning strategy. CFNet consists of 6 of blocks convolution, batchnorm and ReLU combination for feature encoding and 2 blocks of deconvolution and LReLU combination for feature decoding. Efficiency of CFNet is experimental verified, which costs only nearly 1/6 computation burden of DeepLabv3+. To force CFNet focusing on easily-distorted areas, we propose distortion-guided feature learning strategy, which is shown in Figure 8. After extracting features, cross-entropy loss is calculated based on provided label and reweighted by the predicted distortion map, denoted as L. With the training strategy, CFNet only needs to focus on easily-distorted areas, no matter how poorly on other ”easy” areas, which effectively reduces learning difficulty.
To prevent excessive correction, we introducing predicted distortion maps into feature correction procedure, which is shown in Figure 8. The propagated features is rectified by the extracted correction cue under the guidance of a predicted distortion map :
where denotes the corrected feature.
Iii-D Training Strategy
The training strategy of our proposed framework is illustrated in Figure 9. Here we first briefly introduce the training procedure [zhu2017deep] widely used in previous works by an example. For video semantic segmentation, one training sample can be denoted by a -tuple (Frame, Frame, GT), where Frame and Frame are the key frame and current frame respectively, and GT is the segmentation groundtruth of Frame. During training, Frame is fed into the image segmentation model to extract the features, and meanwhile the optical flow between Frame and Frame is estimated with FlowNet. Then the extracted features are propagated to Frame, and the CrossEncropy loss at Frame is calculated to train network. In practice, Frame is randomly selected from a 10 frames video clip and Frame is always the last one, which enriches the diversity of training samples. All involved components are trained jointly except for Net.
However, the above training procedure may be unstable due to inaccuracy of optical flow estimation, especially for long-distance propagation (e.g., larger than 5 frames). In this work, we propose dual deep supervision (DDS) to provide extra supervisions for better network training. Specifically, we add an intermediate frame in each training sample, denoted by Frame, which actually reduces the propagation distance by imposing the supervision signal on Frame. Consequently, the optical flow is easier to be predicted due to smaller scene motion, and the stabilization of network training can be improved due to imposing more supervision signals. Besides, we argue that two warp operations in one training iteration is more appropriate than the original one, since features get propagated from its previous frame rather than always the key frame.
In our experiments, Frame is randomly selected to ensure the diversity of training samples. To be specific, we extract the features of Frame, and then conduct feature propagation twice (Frame Frame Frame). Then we produce the pseudo label of Frame using the image segmentation model. Pesudo label is a natural and popular way to improve the segmentation quality in domain adaptation [zou2018unsupervised]hungadversarial]. Finally, we use the constructed supervision signal on the feature propagation and correction procedures respectively, as shown in Figure 9. In particular, propagation supervision (L) works on the warped features for improving the quality of optical flow, and correction supervision (L) works on the corrected features for enhancing the ability of feature correction. With the proposed distortion-guided feature learning L in FCM, our final loss is
In this section, we experimentally evaluate our proposed method on two challenging datasets, namely, Cityscapes [cordts2016cityscapes] and CamVid [brostow2009semantic], and some state-of-the-art methods are used for comparison. We conduct all of the experiments on the NVIDIA GTX 1080Ti GPUs.
Cityscapes [cordts2016cityscapes] is a representative dataset in semantic segmentation and autonomous driving domain. It focuses on semantic understanding of urban street scenes. The training and validation subsets contain and video clips, respectively, and each video clip contains frames. The 20 frame in each clip is annotated by pixel-level semantic labels with categories.
CamVid [brostow2009semantic] similarly focuses on the semantic understanding of urban street scenes, but it contains less data than Cityscapes. It only has color images with annotations of semantic classes. CamVid is divided into the trainval set with samples and test set with samples. All samples are extracted from driving videos captured at daytime and dusk, and have pixel-level semantic annotations. Each CamVid video contains to frames at a resolution of .
Iv-B Evaluation Metrics
We experimentally evaluate the video semantic segmentation methods by measuring the segmentation accuracy and computational efficiency.
For segmentation accuracy, we propose to use propagation distance vs. accuracy curve (PDA Curve), which indicates how the segmentation accuracy changes for different propagation distances. Some previous works [zhu2017deep, jain2019accel] use the average accuracy among different propagation distances, which is inconvenient to figure out the actual performance. For computational efficiency, we propose to use computation cost vs. accuracy curve (CCA Curve). CCA Curve is an important metric for model deployment, which indicates how the segmentation accuracy changes for different average computation cost.
In the experiments on Cityscapes, we set the to frames as the candidates of key frame and propagate it to the annotated frame, which is used to measure the segmentation accuracy for each video clip. That is, the propagation distance (denoted as ) ranges from to , which is used for plotting PDA Curve. When plotting CCA Curve, we first calculate the computation cost of components used on the key frames (i.e., Net and Net) and non-key frames (i.e., FlowNet, FCM, and Net), which are denoted by C and C, respectively. The average computation cost is calculated by
Evaluations on CamVid are similar to Cityscapes. Here mean intersection over union (mIoU) is adopted to measure the segmentation accuracy, and floating point operations per second (FLOPs) is used for the computation cost.
Iv-C Performance Evaluation
We compare our proposed FCM with recent state-of-the-art methods, including DFF [zhu2017deep], DVSNet [xu2018dynamic], and Accel [jain2019accel], and Figure 10 shows the results on Cityscapes val subset with CCA Curve. In particular, the baseline methods only provide the trained model on Cityscapes, and thus here we only give the results on Cityscapes for fair comparison (the results on CamVid will be presented in the ablation study). For DFF and DVSNet, the same network DeeplabFast is used as the segmentation backbone. But DVSNet splits the input frames into four overlapped frame regions and performs multiple times of segmentation, and thus is more time-consuming. For Accel, Deeplab with Deformable ResNet-101 is used as the backbone of image segmentation, and multiple versions of ResNets with different depths are adopted to process the current frame. From Figure 10, it can be seen that our proposed method significantly outperforms other method in both accuracy and efficiency.
Iv-D Ablation Study
Iv-D1 Effectiveness of FCM
Here we verify the effectiveness of FCM on Cityscapes and CamVid, as shown in Figure 11 and Figure 12. For fair comparison, we reimplement the baseline methods using DeepLabv3+ as the segmentation backbone network and the same FlowNet as in our proposed method. In particular, our implemented DeepLabv3+ achieves an mIoU score of on Cityscapes and on CamVid for image segmentation. From the results, it can be seen that our proposed method significantly outperforms other state-of-the art methods, especially for long-distance feature propagation.
Besides, we calculate the average computation cost by fixing the propagation distance as for all methods. The results are shown in Figure 11 and Figure 12 with color bars, in which a lighter color represents higher computation cost. Note that the computation cost of Accel34 is higher than that of Accel 50 because an extra deconvolutional layer is involved in Accel34 for feature upsampling. Actually, we analyze the computation cost of each component in the proposed framework, and the statistics are provided in Table I. It can be seen that the image segmentation network dominates the computation cost. As shown in Figure 11 and Figure 12, our proposed FCM has slightly higher computation cost than DFF and DVSNet, but gains a significant accuracy improvement. Similarly, our method can retain the superiority on a small dataset with fewer training samples (e.g., CamVid).
It is notable that our method can yield higher segmentation accuracy than the per-frame image segmentation for short-distance feature propagation. It is because feature propagation can exploit the information from multiple frames. That is, feature propagation can potentially achieve better results than per-frame segmentation. Due to feature distortion existed in propagation, accuracy drops in baseline methods. Obviously, our method can greatly alleviate this phenomenon and boost accuracy.
Iv-D2 Effectiveness of Distortion Map
In this work, we propose to introduce distortion map in feature correction to only focus on distorted areas, preventing from excessive correction like in Accel. Here we explore the upper bound of Accel and our proposed FCM by only correcting wrongly predicted areas in propagated features, as shown in Figure 13. Obviously, significant gaps exists between different variations of Accel and corresponding upper bounds, which indicates that global feature correction suffers misrectification and deteriorates propagated features. On the contrary, our proposed FCM has merely small gap, which experimentally proves the effectiveness of distortion-guided feature correction.
Besides, we also explore the effect of distortion-guided feature learning and correction in our FCM, as shown in Figure 14. In the ”w/o distortion-guided feature learning (DGFL)” setting, CFNet is trained with original cross-entropy loss. In the ”w/o distortion-guided feature correction (DGFC)” setting, propagated features are merged with correction cues by an average operation. The w/o distortion guidance (DG) setting is the combination of ”w/o DGFL” and ”w/o DGFC”. Obviously, DGFL can improve long-distance propagation by optimizing the training of CFNet and obtaining better correction cues. DGFC can improve short-distance propagation by maximally reusing accurate propagated features.
Iv-D3 Design of Distortion Transfer Strategy
We propose to transfer distortion patterns from propagated features into image space and use DMNet for semantic comparison between the distorted frame and current frame. Here we explore other designs of distortion transfer strategy. Taking the segmentation model into consideration, the propagated features are at the highest level (after classifier) and input images can be regard as lowest-level feature. Actually, we can transfer distortion patterns into features at a lower level and conduct distortion map with extra feature extraction on the current frame. As shown in Figure 15, we provide predicted distortion maps from features at different levels. Obviously, using features at higher level, resulted distortion maps are cleaner and sharper.
Besides, in order to better explore the effect of distortion maps, we conduct experimental comparison on FCM trained with different distortion maps, as shown in Figure 16. Here, ”high-level feat” setting represents that we use propagated features for distortion map prediction. ”Ground Truth” setting means that we directly introduce ground truth of distortion maps into FCM training rather than prediction. Obviously, with high quality distortion maps, FCM can achieve significant improvement. However, with higher-level features for distortion map prediction, we have to take more computation cost on the current frame for feature extraction. There is a trade-off between propagation accuracy and computation cost. For computation efficiency, we choose to conduct distortion map prediction in image space.
Iv-D4 Effectiveness of DDS
Here we explore the effect of dual deep supervision (DDS), as shown in Table II. Specifically, we utilize DDS on DFF and FCM. DDS only contains S with DFF. The experimental results show that DDS can effectively improve segmentation accuracy.
|DFF w/ DDS||76.96||72.86||67.81||72.51|
|FCM w/o DDS||77.30||74.30||71.37||74.26|
We present a novel video semantic segmentation framework in this paper, aiming at achieving high segmentation accuracy and competitive run-time performance simultaneously by tackling feature distortion problem in propagation. Specifically, we propose a distortion pattern transfer strategy for detecting distortion areas in propagated features. Then we propose feature correction module (FCM) to rectify the distorted features locally, guided by the predicted distortion maps. Our experimental results on both Cityscapes and CamVid show that the proposed method outperforms the state-of-the-art methods in both precision and speed.