Official pytorch implementation for "Video Panoptic Segmentation" (CVPR 2020 Oral)
Panoptic segmentation has become a new standard of visual recognition task by unifying previous semantic segmentation and instance segmentation tasks in concert. In this paper, we propose and explore a new video extension of this task, called video panoptic segmentation. The task requires generating consistent panoptic segmentation as well as an association of instance ids across video frames. To invigorate research on this new task, we present two types of video panoptic datasets. The first is a re-organization of the synthetic VIPER dataset into the video panoptic format to exploit its large-scale pixel annotations. The second is a temporal extension on the Cityscapes val. set, by providing new video panoptic annotations (Cityscapes-VPS). Moreover, we propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames. To provide appropriate metrics for this task, we propose a video panoptic quality (VPQ) metric and evaluate our method and several other baselines. Experimental results demonstrate the effectiveness of the presented two datasets. We achieve state-of-the-art results in image PQ on Cityscapes and also in VPQ on Cityscapes-VPS and VIPER datasets. The datasets and code are made publicly available.READ FULL TEXT VIEW PDF
In this paper we present a new computer vision task, named video instanc...
In this paper, we tackle video panoptic segmentation, a task that requir...
Video segmentation approaches are of great importance for numerous visio...
We propose and study a novel 'Panoptic Segmentation' (PS) task. Panoptic...
Semantic segmentation requires large amounts of pixel-wise annotations t...
Semantic video segmentation is a key challenge for various applications....
In this work, we introduce the new scene understanding task of Part-awar...
Official pytorch implementation for "Video Panoptic Segmentation" (CVPR 2020 Oral)
As an effort to unify existing recognition tasks, object classification, detection, and segmentation and to leverage the possible complementariness of these tasks into a single complete task, Kirillov et al.  proposed a holistic segmentation of all foreground instances and background regions in a scene and named the task panoptic segmentation. Since then, a large number of works [19, 15, 20, 37, 24, 10, 17, 18, 31, 7, 8, 40] have proposed learning-based approaches to this new benchmark task, confirming its importance to the field.
In this paper, we extend the panoptic segmentation in the image domain to the video domain. Different from image panoptic segmentation, the new problem aims at a simultaneous prediction of object classes, bounding boxes, masks, instance id associations, and semantic segmentation, while assigning unique answers to each pixel in a video. Figure 1 illustrates sample video sequences of ground truth annotations for this problem. Naturally, we name the new task video panoptic segmentation (VPS). The new task opens up possibilities for applications that require a holistic and global view of video segmentation such as autonomous driving, augmented reality, and video editing. In particular, temporally dense panoptic segmentation of a video can work as intermediate-level representations for even higher-level video understanding tasks such as temporal reasoning or action-actor recognition which anticipates the behaviors of objects and humans. To best of our knowledge, this is the first work to address video panoptic segmentation problem.
segmentation has successfully driven active participation of the community. However, the direction towards the video domain has not yet been explored, probably due to the lack of appropriate datasets and evaluation metrics. While video object/instance segmentation datasets are available these days, no dataset permits direct training of videopanoptic segmentation (VPS). This is not surprising when considering its extremely high cost of collecting such data. To improve the situation, we make an important first step in the direction of panoptic video segmentation, by presenting two types of datasets. First, we adapt the synthetic VIPER  dataset into the video panoptic format and create corresponding metadata. Second, we collect a new video panoptic segmentation dataset, named Cityscapes-VPS, that extends the public Cityscapes to a video level by providing every five video frames with pixel-level panoptic labels that are temporally associated with respect to the public image-level annotations.
In addition, we propose a video panoptic segmentation network (VPSNet) to provide a baseline method for this new task. On top of UPSNet, which is a state-of-the-art method for image panoptic segmentation, we design our VPSNet to take an additional frame as the reference to correlate time information at two levels: pixel-level fusion and object-level tracking. To pick up the complementary feature points in the reference frame, we propose a flow-based feature map alignment module along with an asymmetric attention block that computes similarities between the target and reference features to fuse them into one-frame shape. Moreover, to associate object instances across time, we add an object track head  which learns the correspondence between the instances in the target and reference frames based on their RoI feature similarity. It establishes a baseline for the VPS task and gives us insights into the main algorithmic challenges it presents.
We adapt the standard image panoptic quality (PQ) measure to fit the video panoptic quality (VPQ) format. Specifically, the metric is obtained from a span of several frames, where the sequence of each panoptic segment within the span is considered a single 3D tube prediction to produce an IoU with the ground truth tube. The longer the time-span, the more challenging it is to obtain IoU over a threshold and to be counted as a true-positive for the final VPQ score. We evaluate our proposed method with several other naive baselines using the VPQ metric.
Experimental results demonstrate the effectiveness of the two presented datasets. Our VPSNet achieves state-of-the-art image PQ on Cityscapes and VIPER. More importantly, in terms of VPQ, it outperforms the strong baseline  and other simple candidate methods, while still presenting algorithmic challenges of the VPS task.
We summarize the contribution of this paper as follows.
To our best knowledge, it is the first time that video panoptic segmentation (VPS) is formally defined and explored.
We present the first VPS datasets by re-formatting the virtual VIPER dataset and creating new video panoptic labels based on the Cityscapes benchmark. Both datasets are complementary in constructing an accurate VPS model.
We propose a novel VPSNet which achieves state-of-the-art image panoptic quality (PQ) on Cityscapes and VIPER, and compare it with several baselines on our new datasets.
We propose a video panoptic quality (VPQ) metric to measure the spatial-temporal consistency of predicted and ground truth panoptic segmentation masks. The effectiveness of our proposed datasets and methods is demonstrated by the VPQ evaluation.
The joint task of thing and stuff segmentation is reinvented by Kirillov et al.  in the form of combining the semantic segmentation and instance segmentation tasks and is named panoptic segmentation. Since then, much research [19, 15, 20, 37, 24, 10, 17, 18, 31, 7, 8, 40] has been actively gathered to propose new approaches to this unified task, which is now a de facto standard of visual recognition task. A naive baseline introduced in 
is to train the two sub-tasks separately then fuse the results by heuristic rules. More advanced approaches to this problem present a unified, end-to-end model. Liet al.  propose AUNet which leverages mask level attention to transfer knowledge from the instance head to the semantic head. Li et al.  suggest a new objective function to enforce consistency between things and stuff pixels when merging them into a single segmentation result. Liu et al.  design a spatial ranking module to address the occlusion between the predicted instances. Xiong et al.  introduce a non-parametric panoptic head to predict instance id and resolve the conflicts between things and stuff segmentation.
As a direct extension of semantic segmentation to videos, all pixels in a video are predicted as different semantic classes. However, the research in this field has not gained much attention and not currently popular compared to its counterpart in the image domain. One possible reason is the lack of available training data with temporally dense annotation, as research progress depends greatly on the existence of datasets. Despite the absence of a dataset for Video Semantic Segmentation (VSS), several approaches have been proposed in the literature [21, 33, 43, 26, 14]. Temporal information is utilized via optical flow to improve the accuracy or efficiency of the scene labeling performance. Different from our setting, VSS does not require either discriminating object instances or explicit tracking of the objects across frames. Our new Cityscapes-VPS is a super-set of a VSS dataset and thus is able to benefit this independent field as well.
Even more recently, Yang et al.  proposed a Video Instance Segmentation (VIS) problem to extend image instance segmentation to videos. It combines several existing tasks: video object segmentation [1, 3, 4, 30, 35, 39, 36, 27] and video object detection [9, 42, 43], and aims at simultaneous detection, segmentation, and tracking of instances in videos. They propose Mask-Track R-CNN which has a tracking branch added to Mask R-CNN  to jointly learn these multiple tasks. The object association is trained based on object feature similarity learning, and the learned features are used together with other cues such as spatial correlation and detection confidence to track the objects at inference. The first difference to our setting is that VIS only deals with foreground thing objects but not background stuff regions. Moreover, the problem permits overlaps between predicted object masks and even multiple predictions for a single instance, while our task requires algorithms to assign a single label to all things and stuff pixels. Last but not least, their dataset contains a small number of objects () per frame, whereas we deal with a much larger number of objects ( on average), which makes our task even more challenging.
For a video sequence with frames, we set a temporal window that spans additional consecutive frames. Given a -span snippet , we define a tube prediction as a track of its frame-level segments as , for semantic class and instance id of the tube. Note that instance id for a thing class can be larger than 0, e.g., car-0, car-1, … , whereas it is always for a stuff class, e.g., sky. All pixels in the video are grouped by such tuple prediction, and they will result in a set of stuff and things video tubes that are mutually exclusive to each other. The ground truth tube is defined similarly, with a slight adjustment concerning the annotation frequency as described below. The goal of video panoptic segmentation is to accurately localize all the semantic and instance boundaries throughout a video and assign correct labels to those segmented video tubes.
By the construction of the VPS problem, no overlaps are possible among video tubes. Thus, AP metric used in object detection or segmentation cannot be used to evaluate the VPS task. Instead, we borrow the panoptic quality (PQ) metric in image panoptic segmentation with modifications adapted to our new task.
Given a snippet , we denote a set of the ground truth and predicted tubes as and . A set of True Positive matches is defined as TP = : IoU ( }. False Positives (FP) and False Negatives (FN) are defined accordingly. When the annotation is given every frames, the matching only considers the annotated frame indices in a snippet, e.g., when = 10 and = 5, frame , +5 and +10 are considered. We slide the
-span window with a stridethroughout a video, starting from frame 0 to the end, i.e., goes by (We assume frame 0 is annotated). Each stride constructs a new snippet, where we compute the IoUs, TP, FP and FN as above.
At a dataset level, the snippet-level IoU, TP, FP and FN values are collected across all predicted videos. Then, the dataset-level VPQ metric is computed per each class , and averaged across all classes as,
where in the denominator is to penalize unmatched tubes, as suggested in the image PQ metric.
By definition, = 0 will make the metric equivalent to the image PQ metric, and = -1 will construct a set of whole video-long tubes. Any cross-frame inconsistency of semantic or instance label prediction will result in a low tube IoU, and may drop the match out of the TP set, as illustrated in Figure 2. Therefore, the larger window size we have, the more challenging it is to get a high VPQ score. In practice, we include different window sizes to provide a more comprehensive evaluation. The final VPQ is computed by averaging over = 4 as, .
Having different values enables a smooth transition from the existing image PQ evaluation to videos, encouraging the image-to-video transition of further technical developments for this pioneering field to leap forward.
We set as a user-defined parameter. Having such a fixed temporal window size regularizes the difficulty of IoU matching across video samples of different lengths. On the other hand, the difficulty of matching whole -long tubes, extremely varies with the video length, e.g., when = 10 and = 1000.
We empirically observed that, in our Cityscapes-VPS dataset (), many object associations are disconnected by significant scene changes when . Given a new annotation frequency (1/), the shall be reset, which will accordingly set a level of difficulty for the dataset.
There are several public datasets which have dense panoptic segmentation annotations: Cityscapes , ADE20k , Mapillary , and COCO . However, none of these datasets matches the requirement for our video panoptic segmentation task. Thus, we need to prepare a suitable dataset for the development and evaluation of video panoptic segmentation methods. We pursue several directions when collecting VPS datasets. First, both the quality and quantity of the annotation should be high, of which the former is a common problem in some of the existing polygon-based segmentation datasets and the latter is limited by the extreme cost of panoptic annotations. More importantly, it should be easily adaptable to and extensible from the existing image-based panoptic datasets, so that it can promote the research community to seamlessly transfer the knowledge between the image and video domains. With the above directions in mind, we present two VPS datasets by 1) reformatting the VIPER dataset and 2) creating new video panoptic annotations based on the Cityscapes dataset.
To maximize both the quality and quantity of the available annotations for the VPS task, we take advantage of the synthetic VIPER dataset  extracted from the GTA-V game engine. It includes pixel-wise annotations of semantic and instance segmentations for 10 thing and 13 stuff classes on 254K frames of ego-centric driving scenes at resolution. As shown in Figure 1-(top row), we tailor their annotations into our VPS format and create metadata in a popular COCO style, so that it can be seamlessly plugged into recent recognition models such as Mask-RCNN .
|Instances||4297||60 K||31 K||10 K|
|Masks||115 K||60 K||2.8 M||56 K|
Instead of building our dataset from scratch in isolation, we build our benchmark on top of the public Cityscapes dataset , which is the most popular dataset for panoptic segmentation, together with COCO. It consists of image-level annotated frames of ego-centric driving scenarios, where each labeled frame is the 20th frame in a 30 frame video snippet. There are 2965, 500, and 1525 such sampled images paired with dense panoptic annotations for 8 thing and 11 stuff classes for training, validation, and testing, respectively. Specifically, we select the validation set to build our own video-level extended dataset. We select every five frames from each of the 500 videos, i.e., 5, 10, 15, 20, 25, and 30-th frames, where the 20-th frame already has the original Cityscapes panoptic annotations. For the other 5 frames, we ask expert turkers to carefully label each pixels with all 19 classes and instance ids to be consistent with the 20-th frame as reference. It is also asked to have similar level of pixel quality, as shown in Figure 1-(bottom row). Our resulting dataset provides additional 2500 frames of dense panoptic labels at resolution that temporally extend the 500 frames of the Cityscapes labels. The new benchmark is referred to as Cityscapes-VPS.
Our new dataset Cityscapes-VPS is not only the first benchmark for video panoptic segmentation but also a useful benchmark for other vision tasks such as video instance segmentation and video semantic segmentation; the latter has also been suffering lack of well-established video benchmark. We show some high-level statistics of the reformatted VIPER and new Cityscapes-VPS, and related datasets in Table. 1.
Unlike static images, videos have rich temporal and motion context, and a VPS model should faithfully use this information to capture the panoptic movement of all things and stuff classes in a video. We propose a video panoptic segmentation network (VPSNet). Given an input video sequence, VPSNet performs object detection, mask prediction, tracking, and semantic segmentation all simultaneously. This section describes our network architecture and its implementation in detail.
By the nature of the VPS task, temporal inconsistency in any of the class label and instance id will result in low video quality of these panoptic segmentation sequences. More strict requirements are therefore in place for the thing classes. With this consideration in mind, we design our VPSNet to use video context in two levels: pixel level and object level. The first is to leverage neighboring frame features for the downstream multi-task branches and the second is to explicitly model cross-frame instance association specifically for tracking. Each module for feature fusion and object tracking is not totally new in isolation, but they both are jointly used for the first time for the task of video panoptic segmentation. We call each of them Fuse and Track module throughout the paper. The overall model architecture is shown in Figure 3.
We build upon an image-level panoptic segmentation network. While not being sensitive to any specific design of a baseline network, we choose the state-of-the-art method, UPSNet , which adopts Mask R-CNN  and deformable convolutions  for instance and semantic segmentation branches respectively with a panoptic head that combines these two branches. One of the several modifications is that we do not use unknown class predictions for the simplicity of the algorithm. Also, we have an extra non-parametric neck layer, which is inspired by Pang et al. . They use balanced semantic features to enhance the pyramidal neck representations. Different from theirs, our main design purpose is to have a representative feature map itself at a single resolution level. For this reason, our extra neck consists of only the gather and redistribute steps with no additional parameters. First, at the gather step, the input feature pyramid network (FPN)  features are resized to the highest resolution i.e., the same size as , and element-wise summed over multiple levels, to produce . Then, this representative feature is redistributed to the original features by a residual addition.
The main idea is to leverage video context to improve the per-frame feature by temporal feature fusion. At each time step , the feature extractor is given a target frame and one (or more) reference frame(s) , then produces FPN features and . We sample the reference frame with
We propose an align-and-attend pipeline at in between the gather and redistribute steps. Given the gathered features and , our align module learns flow warping to align the reference feature onto the target feature . The align module receives an initial optical flow computed by FlowNet2 
, and refine it for more accurate deep feature flow. After concatenating these aligned features, ourattend module learns spatial-temporal attention to reweight the features and fuse the time dimension into one to get , which is then redistributed to which are then fed forward to the downstream instance and semantic branches.
Here, the goal is to track all object instances in with respect to those in . Along with the multi-task heads for panoptic segmentation, we add the MaskTrack head  which is used in a state-of-the-art video instance segmentation method. It learns a
feature affinity matrixbetween generated RoI proposals from and RoI features from . For each pair
, a Siamese fully-connected layer embeds them into single vectors
, then the cosine similarity is measured by.
MaskTrack is designed for still images and only utilizes appearance features, and does not use any video features during training. To handle this problem, we couple the tracking branch with the temporal fusion module. Specifically, every RoI features are first enhanced by the above temporal fused feature, , from multiple frames, and thus become more discriminative before being fed into the tracking branch. Therefore, from a standpoint of the instance tracking, our VPSNet synchronizes it on both pixel-level and object-level. The pixel-level module aligns local feature of the instance to transfer it between the reference and target frames, and the object-level module focuses more on distinguishing the target instance from other reference objects by the similarity function on the temporally augmented RoI features. During training, the tracking head in our VPSNet is the same as . During the inference stage, we add an additional cue from the panoptic head: the IoU of thingslogits. The IoU of instance logits can be viewed as a deformation factor or spatial correlation between frames and our experiments show that it improves the video panoptic quality for things classes.
We follow most of the settings and hyper-parameters of Mask R-CNN and other panoptic segmentation models such as UPSNet . Hereafter, we only explain those which are different. Throughout the experiments, we use ResNet-50 FPN [12, 22] as the feature extractor.
We implement our models in PyTorch with MMDetection  toolbox. We use the distributed training framework with 8 GPUs. Each mini-batch has 1 image per GPU. We use the ground truth box of a reference frame to train the track head. We crop random pixels out of Cityscapes and VIPER images after randomly scaling each frame by 0.8 to 1.25 . Due to the high resolution of images, we downsample the logits for semantic head and panoptic head to
pixels. Besides the RPN losses, our VPSNet contains 6 task-related loss functions in total: bbox head (classification and bounding-box), mask head, semantic head, panoptic head, and track head. We set all loss weights to 1.0 to make their scales to be roughly on the same order of magnitude.
We set the learning rate and weight decay as 0.005 and 0.0001 for all datasets. For VIPER, we train for 12 epochs and apply lr decay at 8 and 11 epochs. For both Cityscapes and Cityscapes-VPS, we train for 144 epochs and apply lr decay at 96 and 128 epochs. For the pretrained models, we import COCO- or VIPER-pretrainedBase model parameters and initialize the remaining layers, e.g., Fuse (align-and-attend) and Track modules, by Kaiming initialization.
Given a new testing video, our method processes each frame sequentially in an online fashion. At each frame, our VPSNet first generates a set of instance hypotheses. As a mask pruning process, we perform the class-agnostic non-maximum suppression with the box IoU threshold as 0.5 to filter out some redundant boxes. Then the remaining boxes are sorted by the predicted class probabilities and kept if the probability is larger than 0.6. For the first frame of a video sequence, we assign instance ids according to the order of the probability. For all other frames, the remaining boxes after pruning are matched to identified instances from previous frames based on the learned affinity , and are assigned instance id accordingly. After processing all frames, our method produces a sequence of panoptic segmentation, each pixel of which contains a unique category label and instance label throughout the sequence. For both IPQ and VPQ evaluation, we test all available models with single scale testing.
In this section, we present the experimental results on the two proposed video-level datasets, VIPER and Cityscapes-VPS, as well as the conventional image-level Cityscapes benchmark. In particular, we mainly investigate the results in two aspects: image-level prediction and cross-frame association, which will be reflected in the IPQ and VPQ, respectively. We demonstrate the contributions of each of the proposed pixel-level Fuse and object-level Track modules in the performance of video panoptic segmentation. Here is the information on the dataset splits used in experiments.
VIPER: Based on its high quantity and quality of the panoptic video annotation, we mainly experiment with this benchmark. We follow the public train / val split. For evaluation, we choose 10 validation videos from day scenario, and use the first 60 frames of each videos: total 600 images.
Cityscapes: We use the public train / val split, and evaluate our image-level model on the validation set.
Cityscapes-VPS: The created video panoptic annotations are given with the 500 validation videos of Cityscapes. We further split these videos into 400 training, 50 validation, and 50 test videos. Each video consists of 30 consecutive frames, with every 5 frames paired with the ground truth annotations. For each video, all 30 frames are predicted, and only the 6 frames with the ground truth are evaluated.
One thing we can expect from the VPS learning compared to its image-level counterpart is whether it improves per-frame PQ by properly utilizing spatial-temporal features. We evaluate our method with the existing panoptic quality (PQ), recognition quality (RQ), and segmentation quality (SQ). The results are presented in Table 2 and Table 3.
First, we study the importance of the proposed Fuse and Track modules to our image-level panoptic segmentation performance on the VIPER dataset as shown in Table 2. We find that both pixel-level and object-level modules have complementary contributions, each improving the baseline by +1% PQ. Without any of them, the PQ will drop by -3.4%. The best PQ was achieved when these two modules are used together.
We also experiment on the Cityscapes benchmark, to provide a comparison with the state-of-the-art panoptic segmentation methods. Our VPSNet with only the Fuse module can be trained in this setting, since it only requires a neighboring reference frame without any extra annotations. In Table 3, we find that VPSNet with Fuse module outperforms the state-of-the-art baseline method  by +1.0% PQ, which implies that it effectively exploits spatial-temporal context to improve per-frame panoptic quality. The pretraining on the VIPER dataset shows its complementary effectiveness to either COCO or Cityscapes dataset by boosting the score by +1.6% PQ from our baseline, achieving 62.2% PQ. We also converted our results into semantic segmentation format, and achieved 79.0% mIoU.
We evaluate the spatial-temporal consistency between the predicted and ground truth panoptic video segmentation. The quantitative results are shown in Table 4 and Table 5. Different from image panoptic segmentation, our new task requires extra consistency in instance ids over time, which makes the problem much more challenging for things than stuff classes. Not surprisingly, the mean video panoptic quality of things classes (VPQTh) is generally lower than that of stuff classes (VPQSt).
Since there is no prior work directly applicable to our new task, we present several baseline VPS methods to provide a reference level. Specifically, we enumerate over different methods by replacing only the tracking branch of our VPSNet. The alternative tracking methods are object sorting by classification logit values (Cls-Sort), and flow-guided object matching by mask IoU (IoU-Match). First, Cls-Sort relies on semantic consistency of the same object between frames. However, it fails to track objects possibly because there are a number of instances of the same class in a frame, e.g., car, person, thus making the class logit information not enough for differentiating these instances. On the other hand, IoU-Match is a simple yet strong candidate method for our task by leveraging spatial correlation to determine the instance labels, improving the image-level baseline by +9.7% VPQ.
Our model with Track module further improves this by +1.2% VPQ, by using the learned RoI feature matching algorithm together with the semantic consistency and spatial correlation cues. Our full model with both Fuse and Track modules achieves the best performance by a great improvement of +17.0% over the image-level base model, and +6.1% VPQ over the variant with only-Track module. To show the contribution of the fused feature solely on the object matching performance, we experiment with a VPSNet variant where the fused feature is fed to all task branches except for the tracking branch (Disjoined). The result implies that the Fuse and Track modules share information, and synergize each other to learn more discriminative features for both segmentation and tracking. We observed the consistent tendency with our Cityscapes-VPS dataset, where our full VPSNet (FuseTrack) achieves +1.1% VPQ higher than the Track variant. Figure 4 shows the qualitative results of our VPSNet on VIPER and Cityscapes-VPS.
We find several challenges still remaining for our new task. First, even the state-of-the-art video instance tracking algorithm  and our VPSNet suffer a considerable performance drop as the temporal length increases. In the context of video, possible improvements are expected to made on handling a large number of instances and resolving overlaps between these objects, e.g., Figure 4-(4th row), by better modeling the temporal information [27, 43]. Second, our task is still challenging for stuff classes as well considering the fact that the window size of 15 frames represents only second in a video. The mutual exclusiveness between things and stuff class pixels could be further exploited to encourage both semantic segmentation and instance segmentation to regularize each other.
We present a new task named video panoptic segmentation with two types of associated datasets. The first is to adapt the synthetic VIPER dataset into our VPS format, which can provide maximal quantity and quality of panoptic annotations. The second is to create a new video panoptic segmentation benchmark, Cityscapes-VPS which extends the popular image-level Cityscapes dataset. We also propose a new method, VPSNet, by combining the temporal feature fusion module and object tracking branch with a single-frame panoptic segmentation network. Last but not least, we suggest a video panoptic quality measure for evaluation to provide early explorations towards this task. We hope the new task and new algorithm will drive the research directions to step forward towards video understanding in the real-world.
This work was in part supported by the Institute for Information & Communications Technology Promotion (2017-0-01772) grant funded by the Korea government. Dahun Kim was partially supported by Global Ph.D. Fellowship Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018H1A2A1062075).
The cityscapes dataset for semantic urban scene understanding.In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Flownet 2.0: Evolution of optical flow estimation with deep networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.