Log In Sign Up

How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation

Real-time semantic segmentation on high-resolution videos is challenging due to the strict requirements of speed. Recent approaches have utilized the inter-frame continuity to reduce redundant computation by warping the feature maps across adjacent frames, greatly speeding up the inference phase. However, their accuracy drops significantly owing to the imprecise motion estimation and error accumulation. In this paper, we propose to introduce a simple and effective correction stage right after the warping stage to form a framework named Tamed Warping Network (TWNet), aiming to improve the accuracy and robustness of warping-based models. The experimental results on the Cityscapes dataset show that with the correction, the accuracy (mIoU) significantly increases from 67.3 FPS. For non-rigid categories such as "human" and "object", the improvements of IoU are even higher than 18 percentage points.


page 2

page 5

page 6

page 12


TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge

Real-time semantic video segmentation is a challenging task due to the s...

Video Semantic Segmentation with Distortion-Aware Feature Correction

Video semantic segmentation is active in recent years benefited from the...

Fast Semantic Segmentation on Video Using Motion Vector-Based Feature Interpolation

Models optimized for accuracy on challenging, dense prediction tasks suc...

GSVNet: Guided Spatially-Varying Convolution for Fast Semantic Segmentation on Video

This paper addresses fast semantic segmentation on video.Video segmentat...

Stage-Aware Feature Alignment Network for Real-Time Semantic Segmentation of Street Scenes

Over the past few years, deep convolutional neural network-based methods...

Real-Time Selfie Video Stabilization

We propose a novel real-time selfie video stabilization method. Our meth...

Warping of Radar Data into Camera Image for Cross-Modal Supervision in Automotive Applications

In this paper, we present a novel framework to project automotive radar ...

1 Introduction

Semantic video segmentation aims at generating a sequence of pixel-wise class label maps for consecutive frames in a video. A real-time solution to this task is challenging due to the stringent requirements of speed and space. Prevailing real-time methods can be grouped into two major categories: per-frame methods and warping-based methods. Per-frame methods decompose the video task into a stream of mutually independent image segmentation tasks. They usually reduce the resolution of input images [2, 14, 29] or adopt a lightweight CNN [8, 14, 15, 19, 23, 29, 32, 33] to meet the real-time demand. In light of the visual continuity between adjacent video frames, warping-based approaches [7, 11, 16, 28, 36] utilize some inter-frame motion estimation (e.g., optical flows [28, 36]

or motion vectors 

[11]) to avoid redundant computations, propagating (or warping) the segmentation results from the previous frame to the current one.

Although existing warping-based methods significantly boost the inference speed by saving the computational time for a large number of redundant frames, their performance drops considerably due to the following limitations. 1) Every pixel in a non-key frame is estimated by motion flows from some pixels in the previous frame. Such estimation may work for continuous regions in a video, but the moving of a non-rigid object such as a human usually becomes severely deformed, as shown in Fig. 1(a), since a non-rigid object in a new frame can contain a considerable amount of new pixels which cannot be estimated from the previous frame. 2) Errors accumulatealong consecutive non-key frames, making the results of later frames almost unusable, as shown in Fig. 1(b). All in all, warping behaves like a runaway fierce creature. The key to the issue is to tame it — to take advantage of its acceleration and to keep it under control.

Figure 1: Qualitative results. The warped results are shown in the second row. (a) Non-rigid moving objects suffer deformation. (b) Warping results in error accumulation along consecutive non-key frames. By adding the lightweight correction stage, these two kinds of problems are significantly alleviated as shown in the third row.

To deal with the issues above, we propose to introduce a lightweight correction stage after the warping stage, resulting in a novel framework, Tamed Warping Network (TWNet), which contains a novel architecture and two correction modules for every non-key frame. First of all, we propose the non-key-frame CNN (NKFC) to perform segmentation for non-key frames. As shown in Fig. 2

, NKFC fuses warped deep features from a previous frame with the features from its shallow layers, making itself fast and able to retain spatial details, the requisites to do the following correction. To alleviate the issue of deformation, we introduce a correction stage consisting of two modules. We design the context feature rectification (CFR) module to correct the warped features with the help of spatial details contained in NKFC. To help CFR focus on error-prone regions, we design the residual-guided attention (RGA) module to utilize the

Residual Maps in the compressed domain. It is worth noting that TWNet alleviates error accumulation in that it corrects every non-key frame rather than indulging the warping without restriction.

Figure 2: (a) Commonly-used warping-based approaches. (b) Our TWNet. Black arrows denote the main path of the network, while grey arrows denote the lateral connections [17]. The warped layers are indicated by red color. The dotted arrows and boxes denote skipped operations.

We evaluate TWNet on the Cityscapes [6] and CamVid [3] datasets. Experimental results demonstrate that TWNet greatly improve the accuracy and robustness of warping-based models. Specifically, for the videos on Cityscapes, the accuracy (mIoU) significantly increases from to by adding the correction, and the speed edges down from 65.5 FPS to 61.8 FPS. Furthermore, we found the performance improvements of non-rigid categories such as “human” and “object” are even higher than 18 percentage points.

The main contributions of this work are summarized as follows:

  • We propose a novel framework, Tamed Warping Network (TWNet), to improve the performance of warping-based segmentation models by introducing a lightweight correction stage.

  • We propose two efficient correction modules, the Context Feature Rectification module as well as the Residual-Guided Attention module, to alleviate the non-rigid object deformation and error accumulation problems during feature warping.

  • Experimental results on Cityscapes and CamVid demonstrate that the proposed TWNet framework greatly improves the robustness and accuracy.

2 Related Work

Feature fusion in semantic segmentation. Semantic segmentation in an FCN manner [18] has achieved remarkable accuracy. Recently, high-quality models [5, 9, 31, 34] as well as high-speed lightweight approaches [14, 19, 29, 33] show the importance of feature fusion from different layers (or scales). Generally, shallow layers contain more low-level spatial details, while deep layers more contextual information. The combination of features from different layers significantly improve the accuracy.

Feature fusion in our proposed non-key-frame CNN is similar to those in FPN [17] and U-Net [21]

, where lateral connections are used to fuse the low-level (spatial) and high-level (context) features. In comparison, our non-key-frame CNN only retains a small number of layers of the encoder to extracts low-level features and obtains high-level features by interior feature warping. Thus, the non-key-frame CNN saves the heavy computations of context feature extraction.

Warping-based video segmentation. In general, videos are of temporal continuity, i.e., consecutive video frames look similar, making it possible to perform inter-frame prediction. The process of mapping pixels from the previous frame to the current one according to a pixel-level motion map is called image warping (Fig. 3). Each point in the motion map is a two-dimensional vector, , representing the movement from the previous frame to the current one. Motion vectors and optical flows are two kinds of commonly used motion maps. In general, motion vectors, already contained in videos, are less precise than optical flows (e.g., TV-L1 [30] FlowNet2.0 [10] and PWC-Net [26]). However, it takes extra time to perform optical flow estimation.

Researchers have proposed many warping-based semantic video segmentation approaches [7, 11, 28, 36]. Gadde et al. [7] proposed to enhance the features of the current frame by adding it with the warped features from previous frames. Zhu et al. [36], Xu et al. [28] and Jain et al. [11] proposed to use feature warping for acceleration. They divided frames into two types, key frames and non-key frames. Key frames are sent to the CNN for segmentation, while non-key frame results are obtained by warping. These warping-based approaches efficiently speed up the inference phase since the computational cost of warping is much less than that of CNN. However, both accuracy and robustness of these methods deteriorate due to the following reasons. First, neither optical flows nor motion vectors can estimate the precise motion of all pixels. Thus, there always exist unavoidable biases (errors) between the warped features and our expected ones. Second, in the case of consecutive non-key frames, the errors accumulate fast, leading to unusable results. To address the error accumulation problem, Li et al. [16] and Xu et al. [28] proposed to adaptively select key frames by predicting the confidence score for each frame. Jain et al. [11] introduced bi-directional features warping to improve the accuracy. However, all these approaches lack the ability to correct warped features.

3 Preliminaries: Warping-Correction in Video Codecs

Figure 3: Visualization of Warping and Correction. (a) In image space, video codecs first warp the previous frame to the current one, then add the image-space residual map for correction. (b) We propose to learn the residual in feature space to rectify the warped features. Note that for better visualization, we use segmentation labels to represent features.

In this section, we describe the basics and the pipeline of warping-correctionin video codecs. Warping is an efficient operation for inter-frame prediction. Given the previous frame and the motion vectors of the current frame , we can estimate the current frame by image warping. The process is described as


For a particular location index in an image, the mapping function is given by


However, there always exist biases between the warped image and the real one, especially in some complicated scenes and for the non-rigid moving objects. In order to correct the biases in image space, modern video codecs (e.g., MPEG [13], H.264 [25]) add a correction step after image warping (Fig.3(a)). Specifically, the codec performs pixel-wise addition between the warped image and the residual map to make correction. Each point in is a three-dimensional vector, , which describes the color differences between the warped pixel and the expected one. The overall inter-frame prediction process is described as


The correction step effectively addresses the bias problem and the error accumulation problem in image space. Inspired by this, we propose to learn the residual term in feature space (Fig. 3(b)) to alleviate problems incurred by feature warping.

4 Tamed Warping Network

In this section, we introduce the Tamed Warping Network (TWNet). We first outline the overall framework. Next, we introduce our non-key-frame CNN architecture and the correction modules, i.e., the Context Feature Rectification (CFR) module as well as the Residual-Guided Attention (RGA) module. Finally, we present the implementation details.

4.1 Overview

Figure 4: The framework of Tamed Warping Network. The key-frames (I-frames) are sent to the key-frame CNN. The non-key frames (P-frames) are sent to the non-key-frame CNN, where the warped context features are first corrected by the CFR and RGA modules and then fused with the spatial features extracted from the current frame. Both branches output the result label maps and the interior context feature maps.

The whole framework of Tamed Warping Network (TWNet) is illustrated in Fig. 4. In our framework, video frames are divided into key frames (I-frames) and non-key frames (P-frames). The key frames are sent to the key-frame CNN (an U-shape [17, 21] FCN) where the context features of a selected interior layer are sent (warped) to the next frame. For non-key frames, the warped features are first corrected by the Context Feature Rectification (CFR) and Residual-Guided Attention (RGA) modules and then fused with the spatial features of the current frame. Then, the resultant features are used to make prediction and also sent to the next frame. The components of TWNet are described as follows.

4.2 Basic Model: The Non-Key-Frame CNN

Warping can be applied in feature space (Fig. 3(b)) since the “conv” layers in FCN models preserve the position information. Given the features of the previous frame , we can use to predict the features of the current frame by


In this work, we propose the non-key-frame CNN architecture as the basic model to perform segmentation for non-key frames considering both efficiency and accuracy. For one thing, in order to speed up, it applies interior warping to obtain deep context features. For another, different from previous warping methods, the non-key-frame CNN preserves several shallow head layers to add spatial features. Then, the two kinds of features are fused by lateral connections as FPN [17] and U-Net [21]. Meanwhile, it is worth noting that the spatial features are also provided to the correction modules as the input.

4.3 The Correction Stage

Inspired by this correction step in video codecs, we propose to add a correction step (right part of Fig. 3(b)) in our segmentation pipeline. The correction stage consists of two key modules. The Context Feature Rectification (CFR) module corrects the warped features, and the Residual-Guided Attention (RGA) module guides the learning of CFR.

Context Feature Rectification. Although the non-key-frame CNN can add spatial details by employing low-level features via lateral connections, still, the errors in context feature would accumulate during successive feature warping. Thus, we claim the warped context features themselves need to be corrected. Comparing the pipeline of video codecs with warping-based segmentation methods (Fig. 3), we find the main problem of previous methods is the lack of a correction stage, which leads to inaccurate predictions and the error accumulation issue.

Inspired by the correction step in video codecs, we introduce the lightweight Context Feature Rectification (CFR) module to explicitly correct the warped context features. We design the CFR module considering the following aspects. First, the contextual information of the warped features is generally correct, except for the edges of moving objects. Second, the low-level features contain the spatial information, such as “edge” and “shape”. Thus, we claim the shape information in low-level features can help to correct the context feature and we apply feature aggregation to perform this correction. CFR takes the warped context features as well as the spatial features of the current frame as the inputs and outputs the corrected context features , as shown in Fig. 5(a). Specifically, the CFR module adopts a single-layer network, , which takes the concatenation , as the input and outputs the residual in feature space, . The mathematical form of CFR is defined as

Figure 5: Details of (a) CFR and (b) CFR+RGA. “”: element-wise multiplication; “”: element-wise addition.

During the training of CFR, in addition to the commonly used softmax cross entropy loss for pixel-level classification, we propose to employ an L2 consistency loss to minimize the distance between the corrected context features and the features extracted by the per-frame CNN.

Residual-Guided Attention. To guide the learning of CFR and further improve its performance, we propose the Residual-Guided Attention (RGA) module.

In our TWNet, the motion vectors used for feature warping are the same as those used in image warping. Therefore, the biases/errors should appear in the same spatial regions in image space and feature space. Thus, the residual maps in image space can be used as prior information to guide the learning of residuals in feature space . To realize this guidance, we adopt a lightweight spatial attention module named RGA to let the CFR module focus more on the regions with higher responses in . Specifically, we first resize the residual map  to the shape of the warped context features. Then, we calculate the spatial attention map  using a single-layer feed-forward CNN  as follows


Finally, we apply spatial attention by performing element-wise multiplication between and . Fig. 5(b) illustrates the detailed structure of RGA. After applying this module, we formulate the whole process of TWNet for non-key frames, defined as


4.4 Implementation Details

Motion extraction and key frame selection. In our TWNet, we utilize motion vectors as motion maps. Both motion vectors and residuals are already contained in compressed videos. Thus, it takes no extra time to extract them. As for key frame scheduling, we simply regard all I-frames as key frames and P-frames as non-key frames, where I/P-frames are the concepts in video compression. An I-frame (Intra-coded picture) is stored as a complete image, while a P-frame (Predicted picture) is stored by the corresponding motion vectors and residual. Following previous works of [24, 27], we choose MPEG-4 Part 2 (Simple Profile) [25] as the compression standard, where each group of picture (GOP) contains an I-frame followed by 11 P-frames.

Layer selection in the non-key-frame CNN. It is worth noting that the layer of feature map used for warping can be arbitrarily chosen. In general, if we choose a deeper layer, there will be less corresponding head and tail layers (Fig. 6

), since layers in the encode-decoder architecture are paired by lateral connections. Thus, we will obtain a faster while less accurate model. Practically, we can adjust this hyperparameter to strike a balance between speed and accuracy.

Figure 6: Choices of different layers for interior warping. The chosen layer is indicated with red color. The dotted arrows and boxes denote skipped operations.

Training of TWNet. In general, the training of TWNet contains two main steps, i.e., the training of the per-frame CNN and the training of the non-key-frame CNN. The training of the per-frame CNN is similar to that of other image segmentation methods, where the softmax cross entropy loss  and the regularization loss  are used. Then, we fix all the parameters in the per-frame model (the key-frame CNN) and start to train the modules in the non-key-frame CNN. As mentioned in Sec. 4.3, we employ an additional L2 consistency loss to minimize the distance between the corrected features and the context features extracted from the key-frame CNN. The object function of this training step is defined as


where and are weights to balance different kinds of losses.

During inference, we leave out the consistency loss since only one path (i.e., either the key-frame CNN or non-key-frame CNN) is used.

5 Experiments

In this section, we present the experimental results of TWNet on high-resolution videos. First, we introduce the experimental environment. Then, we perform ablation study to validate the effectiveness of the proposed the non-key-frame CNN, CFR, and RGA. Finally, we compare TWNet with the state-of-the-art semantic video segmentation methods.

5.1 Experimental Setup

Dataset. There are several commonly used datasets (e.g., Cityscapes [6], CamVid [3], COCO-Stuff [4], and ADE20K [35]) for semantic segmentation. We perform training and evaluation on the Cityscapes dataset [6] considering its high-resolution property and the requirements that the inputs should be video clips rather than images. Cityscapes contains 5000 high-resolution finely annotated images, which are divided into 2975, 500, and 1525 for training, validation and testing respectively. Each image is the frame of a video clip. The dataset contains 19 classes for training and evaluation.

We perform ablation study on the validation set of Cityscapes and compare the results of TWNet with the state-of-the-art methods. Also, we perform experiments on CamVid to demonstrate that our framework is generic to different datasets.

Training. The training of TWNet is divided into two steps, i.e., the training of per-frame CNN and the training of the non-key-frame CNN. To train the per-frame model, we use the 2975 fine-annotated training images (i.e., the frames in the video clips). We use MobileNet [8]

pre-trained on ImageNet 

[22] as the encoder of the key-frame CNN, and use three cascaded lateral connections [17] as the decoder. We apply the Adam optimizer [12] to train for iterations with initial learning rate and batch size . We update the pre-trained parameters with times smaller learning rate. Weight decay  is set to . Training data augmentations include mean extraction, random scaling between to , random horizontal flipping, and random cropping to  (

for CamVid). We implement the model using Tensorflow 1.12 

[1] and perform training on a single GTX 1080 Ti GPU card.

After the training of per-frame CNN, the parameters in it are fixed and we start to train the non-key-frame CNN. In each training step of non-key-frame CNN, we first send a batch of the frames into the per-frame CNN to extract their context features. Then, we perform warping and correction for the corresponding frames and calculate the loss according to Eq. 8. Note that if the consistency loss is employed, the frames should also be sent to the per-frame model to calculate the context features. In this training phase, random cropping is not adopted since the warping operation may exceed the cropped boundary. We set training size to  ( for CamVid) with batch size 4.

Evaluation. In the inference phase, we conduct all the experiments on video clips with the resolution of . During evaluation, the key frame is uniformly sampled from the to the frame in the video clip, and the prediction of the frame is used for evaluation. No testing augmentation (e.g., multi-scale or multi-crop) is adopted. The accuracy is measured by mean Intersection-Over-Union (mIoU), and the speed is measured by frames per second (FPS). Our models run on a server with an Intel Core i9-7920X CPU and a single NVIDIA GeForce GTX 1080 Ti GPU card.

5.2 Ablation Study

We start building TWNet from the training of the per-frame model. We adopt the commonly used lightweight CNN, MobileNetV1, as the encoder. Our per-frame model achieves mIoU at 35.5 FPS.

The non-key-frame CNN. As described in Sec. 4.2, in the non-key-frame CNN, the layer of feature maps can be arbitrarily chosen to balance the accuracy and speed. We choose three layers in the decoder as the context features respectively. Results are summarized in Table 1.

According the experimental results, fine-tuning (the second training step) can significantly improve the performance. This demonstrates that low-level spatial features are more discriminative in the non-key-frame CNN, possibly due to the fact that the warped context features are less reliable in the non-key-frame CNN, thus the model depends more on low-level spatial features.

Warping Layer  Fine-tuned  mIoU  FPS
Layer 1 67.3 65.5
69.6 65.5
Layer 2 65.4 89.8
67.8 89.8
Layer 3 - 63.2 119.7
Table 1: Performance comparison of different layers used for interior warping. Interior warping: the layer of feature map used for interior warping; Fine-tuned: whether the second training step is performed to fine-tune the non-key-frame CNN. If not, the parameters of the head and tail layers keep the same as those in the per-frame CNN. If Layer 3 is chosen, there exists no trainable parameters and hence no fine-tuning.
Warping Layer    mIoU  FPS
Layer 1 0 69.9 63.1
1 70.2 63.1
10 70.6 63.1
20 70.3 63.1
Layer 2 0 67.6 86.3
1 68.1 86.3
10 68.6 86.3
20 68.3 86.3
Table 2: Validation of during the training of CFR. : the weight of .

CFR module and consistency loss. We propose the CFR module to correct the warped context features. As shown in Table 2, the CFR module is effective and efficient. Table 2 also demonstrates the effectiveness of consistency loss, , and the weight term, , a crucial hyper-parameter for the training of CFR. By default, we set to 10 in the following sections for better performance.

RGA module. We introduce the RGA module to further exploit the correlation between residuals in image space and feature space. Results are demonstrated in Table 3. As expected, the RGA module further improves the performance of TWNet since it guides the CFR module to pay more attention to error-prone regions. The qualitative results of TWNet on Cityscapes are shown in Fig. 7.

Warping Layer  FT  CFR  RGA  mIoU  FPS
Layer 1 67.3 65.5
69.6 65.5
70.6 63.1
71.6 61.8
Layer 2 65.4 89.8
67.8 89.8
68.6 86.3
69.5 84.9
Table 3: Performance comparison of each module in TWNet. FT: the fine-tuning of the non-key CNN (the second training step); CFR: Context Feature Rectification; RGA: Residual-Guided Attention. “” means the model utilizes the corresponding module.
Figure 7: Qualitative results on the Cityscapes dataset. GT: ground truth; Warping: normal warping; NKFC: the non-key-frame CNN; CFR: context feature rectification; RGA: residual-guided attention.

Category-level improvement. The IoU improvements for different categories are shown in Table 4. The IoUs of non-rigid objects (human, object and vehicle) are improved greatly because of the following reasons. Since the motions of these moving objects are hard to predict, thus, the warped results are inaccurate. Our correction modules correct the wrong predictions and thus improve the accuracy.

Method object human vehicle nature construction sky flat
Warping 43.8 56.7 82.2 86.6 87.1 91.6 96.6
NKFC (The non-key-frame CNN) 51.2 (+7.4) 65.5 (+8.8) 84.6 (+2.4) 89.6 (+3.0) 89.1 (+2.0) 94.0 (+2.4) 97.3 (+0.7)
NKFC + CFR 62.2 (+18.4) 75.2 (+18.5) 89.7 (+7.5) 91.3 (+4.7) 90.8 (+3.7) 94.2 (+2.6) 97.9 (+1.3)
NKFC + CFR + RGA 62.2 (+18.4) 76.1 (+19.4) 90.1 (+7.9) 91.1 (4.5) 91.0 (+3.9) 94.2 (+2.6) 98.0 (+1.4)
Table 4: IoU improvements of different categories. We choose Layer1 here.
Figure 8: Performance degradation of Warp and Warp-Correct. (a): Layer 1 used for interior warping. (b): Layer 2 used for warping. : frame interval between the key frame and the frame to be evaluated. The correction module effectively alleviates the long-term error accumulation problem.
Model  train set  eval set  resolution  mIoU-pf  mIoU  FPS-pf  FPS  FPS norm GPU
Per-frame Models
ICNet [33] train val 67.7 - 30.3 - 49.7 TITAN X(M)
ERFNet [20] train test 69.7 - 11.2 - 18.4 TITAN X(M)
SwiftNetRN-18 [19] train val 74.4 - 34.0 - 34.0 1080 Ti
CAS [32] train val 74.0 - 34.2 - 48.9 1070
Video-based Models
DFF [36] train val 71.1 69.2 1.52 5.6 12.8 Tesla K40
DVSNet1 [28] train val 73.5 63.2 5.6 30.4 30.4 1080 Ti
DVSNet2 [28] train val 73.5 70.4 5.6 19.8 19.8 1080 Ti
Prop-mv [11] train val 75.2 61.7 1.3 7.6 9.6 Tesla K80
Interp-mv [11] train val 75.2 66.6 1.3 7.2 9.1 Tesla K80
Low-Latency [16] train val 80.2 75.9 2.8 8.4 - -
TWNet-Layer1 train val 73.6 71.6 35.5 61.8 61.8 1080 Ti
TWNet-Layer2 train val 73.6 69.5 35.5 84.9 84.9
Table 5: Comparison of state-of-the-art semantic video segmentation models on Cityscapes. Terms with “-pf”: mIoU/FPS for per-frame model; “FPS norm” is calculated based on the performance of GPU. For previous works, we report the results in the most comparable evaluation settings to ours.

Error accumulation. We also conduct experiments to show that TWNet is able to alleviate the error accumulation problem during consecutive warping. Suppose denotes the frame-level interval between the initial key frame and the frame to be evaluated. We set to different values and evaluate the performance of TWNet and the non-key frame CNN without correction modules. Results in Fig. 8 show that the correction modules significantly alleviate the accuracy degradation and improve the robustness of the models. Meanwhile, the employment of CFR and RGA takes little extra time.

5.3 Comparison with Other Methods

We compare the proposed TWNet with other state-of-the-art semantic video segmentation models, as shown in Table 5. Note that the speed (FPS) values measured in different models are just listed for reference, since the experimental environments may vary a lot. All of our models run on the platform with CUDA 9.2, cuDNN 7.3 and Tensorflow 1.12, and we use the timeline tool in Tensorflow to measure the speed. Following the recent work of [19], we include the “FPS norm” value based on the GPU types of previous methods 111GPU Benchmark: Results demonstrate that our TWNet achieves highest inference speed with comparable accuracy on inputs with resolution of . Also, the accuracy of TWNet decreases more slightly than other video-based methods. It is worth noting that we adopt the key-frame CNN to simplify our presentation. If a more delicate per-frame model were used, the performance of TWNet would be further improved.

5.4 Results on the CamVid Dataset

We also conduct experiments on the CamVid dataset, which contains 367, 100, 233 video clips for training, validation and testing. We apply the same configurations as those of Cityscapes except for the crop size. Results are shown in Table 6, showing that TWNet is universal for different datasets. For comparison, our per-frame model achieves mIoU at 103.5 FPS.

Warping Layer  FT  CFR  RGA  mIoU  FPS
Layer 1 68.8 183.1
69.9 183.1
71.0 179.8
71.5 175.2
Layer 2 66.7 252.6
68.1 252.6
69.3 245.8
70.0 240.7
Table 6: Performances on the CamVid test set.

6 Conclusion

We present a novel framework TWNet for real-time high-resolution semantic video segmentation. Specifically, we use warping and employ the non-key-frame CNN architecture for acceleration. In order to alleviate the errors caused by feature warping, we propose two efficient modules, namely CFR and RGA, to correct the warped features by learning the feature-space residual. Experimental results demonstrate that our method is much more robust than warping-based approaches while keeps the high speed.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation 16), pp. 265–283. Cited by: §5.1.
  • [2] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §1.
  • [3] G. J. Brostow, J. Fauqueur, and R. Cipolla (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §1, §5.1.
  • [4] H. Caesar, J. Uijlings, and V. Ferrari (2018) COCO-stuff: thing and stuff classes in context. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1209–1218. Cited by: §5.1.
  • [5] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision, pp. 801–818. Cited by: §2.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §5.1.
  • [7] R. Gadde, V. Jampani, and P. V. Gehler (2017) Semantic video cnns through representation warping. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4453–4462. Cited by: §1, §2.
  • [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    arXiv preprint arXiv:1704.04861. Cited by: §1, §5.1.
  • [9] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 603–612. Cited by: §2.
  • [10] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470. Cited by: §2.
  • [11] S. Jain and J. E. Gonzalez (2018)

    Fast semantic segmentation on video using block motion-based feature interpolation

    In European Conference on Computer Vision, pp. 3–6. Cited by: §1, §2, Table 5.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • [13] D. Le Gall (1991) MPEG: a video compression standard for multimedia applications. Communications of the ACM 34 (4), pp. 46–59. Cited by: §3.
  • [14] H. Li, P. Xiong, H. Fan, and J. Sun (2019) DFANet: deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9522–9531. Cited by: §1, §2.
  • [15] X. Li, Y. Zhou, Z. Pan, and J. Feng (2019) Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9145–9153. Cited by: §1.
  • [16] Y. Li, J. Shi, and D. Lin (2018) Low-latency video semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5997–6005. Cited by: §1, §2, Table 5.
  • [17] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: Figure 2, §2, §4.1, §4.2, §5.1.
  • [18] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
  • [19] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic (2019) In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12607–12616. Cited by: §1, §2, §5.3, Table 5.
  • [20] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo (2017) ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. Cited by: Table 5.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §4.1, §4.2.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) ImageNet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §5.1.
  • [23] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1.
  • [24] Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S. Chang, and Z. Yan (2019) DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1268–1277. Cited by: §4.4.
  • [25] A. Sofokleous (2005) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. Oxford University Press. Cited by: §3, §4.4.
  • [26] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §2.
  • [27] C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2018) Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6026–6035. Cited by: §4.4.
  • [28] Y. Xu, T. Fu, H. Yang, and C. Lee (2018) Dynamic video segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6556–6565. Cited by: §1, §2, Table 5.
  • [29] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision, pp. 325–341. Cited by: §1, §2.
  • [30] C. Zach, T. Pock, and H. Bischof (2007) A duality based approach for realtime TV-L 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: §2.
  • [31] H. Zhang, H. Zhang, C. Wang, and J. Xie (2019) Co-occurrent features in semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–557. Cited by: §2.
  • [32] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei (2019) Customizable architecture search for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11641–11650. Cited by: §1, Table 5.
  • [33] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018) ICNet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision, pp. 405–420. Cited by: §1, §2, Table 5.
  • [34] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §2.
  • [35] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §5.1.
  • [36] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei (2017) Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358. Cited by: §1, §2, Table 5.