The ability to anticipate future events is a key factor towards developing intelligent behavior [sutton98book]. Video prediction has been studied as a proxy task towards pursuing this ability, which can capitalize on the huge amount of available unlabeled video to learn visual representations that account for object interactions and interactions between objects and the environment [mathieu16iclr]. Most work in video prediction has focused on predicting the RGB values of future video frames [mathieu16iclr, RanzatoSzlamBruna2014, srivastava15icml, kalchbrenner17icml].
Predictive models have important applications in decision-making contexts, such as autonomous driving, where rapid control decisions can be of vital importance [Shalev-ShwartzB16longterm, ShalevShwartzS16sample]. In such contexts, however, the goal is not to predict the raw RGB values of future video frames, but to make predictions about future video frames at a semantically meaningful level, e.g. in terms of presence and location of object categories in a scene. Luc et al. [luc17iccv] recently showed that for prediction of future semantic segmentation, modeling at the semantic level is much more effective than predicting raw RGB values of future frames, and then feeding these to a semantic segmentation model.
Although spatially detailed, semantic segmentation does not account for individual objects, but rather lumps them together by assigning them to the same category label. See, e.g., the pedestrians in Figure 1(c). Instance segmentation overcomes this shortcoming by additionally associating with each pixel an instance label, as show in Figure 1(b). This additional level of detail is crucial for down-stream tasks that rely on instance-level trajectories, such as encountered in control for autonomous driving. Moreover, ignoring the notion of object instances prohibits by construction any reasoning about object motion, deformation, etc. Including it in the model can therefore greatly improve its predictive performance, by keeping track of individual object properties, c.f. Figure 1 (c) and (d).
Since the instance labels vary in number across frames, and do not have a consistent interpretation across videos, the approach of Luc et al. [luc17iccv] does not apply to this task. Instead, we build upon Mask R-CNN [he17iccv], a recent state-of-the-art instance segmentation model that extends an object detection system by associating with each object bounding box a binary segmentation mask of the object. In order to forecast the instance-level labels in a coherent manner, we predict the fixed-sized abstract convolutional features used by Mask R-CNN. We obtain the future object instance segmentation by applying the Mask R-CNN “detection head” to the predicted features.
Our approach offers several advantages: (i) we handle cases in which the model output has a variable size, as in object detection and instance segmentation, (ii) we do not require labeled video sequences for training, as the intermediate CNN feature maps can be computed directly from unlabeled data, and (iii) we support models that are able to produce multiple scene interpretations, such as surface normals, object bounding boxes, and human part labels [kokkinos17cvpr]
, without having to design appropriate encoders and loss functions for all these tasks to drive the future prediction.
Our contributions are the following:
the introduction of the new task of future instance prediction, which is semantically richer than previously studied anticipated recognition tasks,
a self-supervised approach based on predicting high dimensional CNN features of future frames, which can support many anticipated recognition tasks,
experimental results that show that our feature learning approach improves over strong optical flow baselines.
2 Related Work
Future video prediction.
Predictive modeling of future RGB video frames has recently been studied using a variety of techniques, including autoregressive models[kalchbrenner17icml], adversarial training [mathieu16iclr], and recurrent networks [RanzatoSzlamBruna2014, srivastava15icml, villegas17iclr]. Villegas et al. [villegas17icml] predict future human poses as a proxy to guide the prediction of future RGB video frames. Instead of predicting RGB values, Walker et al. [walker16eccv] predict future pixel trajectories from static images.
Future prediction of more abstract representations has been considered in a variety of contexts in the past. Lan et al. [lan14eccv] predict future human actions from automatically detected atomic actions. Kitani et al. [kitani12eccv] predict future trajectories of people from semantic segmentation of an observed video frame, modeling potential destinations and transitory areas that are preferred or avoided. Lee et al. predict future object trajectories from past object tracks and object interactions [lee17cvpr]. Dosovitskiy & Koltun [dosovitskiy17iclr] learn control models by predicting future high-level measurements in which the goal of an agent can be expressed from past video frames and measurements.
Vondrick et al. [vondrick16cvpr] were the first to predict abstract CNN features of future video frames to anticipate actions and object appearances in video. Their work is similar in spirit to ours, but where they only predict image-level labels, we consider the more complex task of predicting spatially detailed future instance segmentations. To this end, we forecast spatially dense convolutional features, where Vondrick et al. were predicting the activations of more compact fully connected CNN layers.
Luc et al. [luc17iccv] predicted future semantic segmentations in video by taking the softmax pre-activations of past frames as input, and predicting the softmax pre-activations of future frames. While their approach is relevant for future semantic segmentation where the softmax pre-activations provide a natural fixed-sized representation, it does not extend to the case of instance segmentation since the instance-level labels vary in number between frames and are not consistent across video sequences. To overcome this limitation, we develop predictive models for fixed-sized convolutional features, instead of making predictions directly in the label space. In a direction orthogonal to our work, Jin et al. [jin17nips] jointly predict semantic segmentation and optical flow of future frames, leveraging the complementarity between the two tasks.
Instance segmentation approaches. Our approach can be used in conjunction with any deep network to perform instance segmentation. A variety of approaches for instance segmentation has been explored in the past, including iterative object segmentation using recurrent networks [romera16eccv], watershed transformation [bai17cvpr], and object proposals [pinheiro16eccv]. In our work we build upon Mask R-CNN [he17iccv], which recently established a new state-of-the-art for instance segmentation. This method extends the Faster R-CNN object detector [ren15nips] by adding a network branch to predict segmentation masks and extracting features for prediction in a way that allows precise alignment of the masks when they are stitched together to form the final output.
3 Predicting Features for Future Instance Segmentations
In this section we briefly review the Mask R-CNN instance segmentation framework, and then present how we can use it for anticipated recognition by predicting internal CNN features for future frames.
3.1 Instance Segmentation with Mask R-CNN
The Mask R-CNN model [he17iccv] consists of three main stages. First, a convolutional neural network (CNN) “backbone” architecture is used to extract high level feature maps. Second, a region proposal network (RPN) takes these features to produce regions of interest (ROIs), in the form of coordinates of bounding boxes susceptible of containing instances. The bounding box proposals are used as input to a RoiAlign
layer, which interpolates the high level features in each bounding box to extract a fixed-sized representation for each box, regardless of its size. Third, the features of each RoI are input to the detection branches, which produce refined bounding box coordinates, a class prediction, and a fixed-sized binary mask for the predicted class. Finally, the mask is interpolated back to full image resolution within the predicted bounding box and reported as an instance segmentation for the predicted class. We refer to the combination of the second and third stages as the the “detection head”. The full model is trained end-to-end from images with pre-segmented object instances.
He et al. [he17iccv] use a feature pyramid network (FPN) [lin17cvpr] as backbone architecture, which extracts a set of features at several spatial resolutions from an input image. The feature pyramid is then used in the instance segmentation pipeline to detect objects at multiple scales, by running the detection head on each level of the pyramid. Following [lin17cvpr], we denote the feature pyramid levels extracted from an RGB image by P through P, which are of decreasing resolution for P, where and are respectively the height and width of . The features in P are computed in a top-down stream by up-sampling those in P and adding the result of a 11 convolution of features in a layer with matching resolution in a bottom-up ResNet stream. We refer the reader to the left panel of Figure 2 for a schematic illustration, and to [he17iccv, lin17cvpr] for more details.
3.2 Forecasting Convolutional Features
Given a video sequence, our goal is to predict instance-level object segmentations for one or more future frames, i.e. for frames where we cannot access the RGB pixel values. Similar to previous work that predicts future RGB frames [mathieu16iclr, RanzatoSzlamBruna2014, srivastava15icml, kalchbrenner17icml] and future semantic segmentations [luc17iccv], we are interested in models where the input and output of the predictive model live in the same space, so that the model can be applied recursively to produce predictions for more than one frame ahead. The instance segmentations themselves, however, do not provide a suitable representation for prediction, since the instance-level labels vary in number between frames, and are not consistent across video sequences. To overcome this issue, we instead resort to predicting the highest level features in the Mask R-CNN architecture that are of fixed size. In particular, using the FPN backbone in Mask R-CNN, we want to learn a model that given the feature pyramids extracted from frames to , predicts the feature pyramid for the unobserved RGB frame .
Architecture. The features at the different FPN levels are trained to be input to a shared detection head, and are thus of similar nature. However, since the resolution changes across levels, the spatio-temporal dynamics are distinct from one level to another. Therefore, we propose a multi-scale approach, employing a separate network to predict the features at each level. The per-level networks are trained and function completely independently from each other. For each level, we concatenate the features of the input sequence along the feature dimension. We refer to the “feature to feature” predictive model for level as F2F. The overall architecture is summarized in the right panel of Figure 2.
Each of the F2F networks is implemented by a resolution-preserving CNN, i.e. where input and output have the same resolution. Each network is itself multi-scale as in [mathieu16iclr, luc17iccv], to efficiently enlarge the field of view while preserving high-resolution details. More precisely, for a given level , F2F consists itself of subnetworks F2F, where . The network F2F first processes the input downsampled by a factor . Its output is up-sampled by a factor two, and concatenated to the input downsampled by a factor . This concatenation constitutes the input of F2F which predicts a refinement of the initial coarse prediction. The same procedure is repeated until the final scale subnetwork F2F.
The design of subnetworks F2F is inspired by the one of [luc17iccv], leveraging dilated convolutions to further enlarge the field of view. Our architecture differs in the number of feature maps per layer, the convolution kernel sizes and dilation parameters, to make it more suited for the larger input dimension. We detail these design choices in the supplementary material.
Training. We compute the coarsest P feature level off-line, and train the F2F model efficiently from these pre-computed features. Due to memory constraints, we cannot pre-compute and store the features as the higher resolution P, P and P levels. However, since the features of the different FPN levels are fed to the same recognition head network, these features are similar to the P ones. Hence, we initialize the weights of F2F, F2F, and F2F with the ones learned for F2F, and fine-tune them using features computed on the fly. Each of the F2F networks is trained using an loss on the predicted feature values.
For multiple time step prediction, we can finetune each subnetwork F2F autoregressively using back propagation through time, similar to [luc17iccv]. Unlike the typical “teacher forced” training of recurrent networks, this approach takes into account error accumulation over time. This is possible in our scenario since we predict in a continuous space, rather than in a discrete space as is commonly the case for recurrent networks, e.g. for language modeling. In this case, given a single sequence of input feature maps, we train with a separate loss on all the future frames for which we predict. In our experiments, all models are trained in this autoregressive manner, unless specified otherwise.
4 Experimental Evaluation
In this section we first present our experimental setup and baseline models, and then proceed with quantitative and qualitative results, that demonstrate the positive impact of our F2F approach.
4.1 Experimental setup: Dataset and evaluation metrics
Dataset. In our experiments, we use the Cityscapes dataset [cordts16cvpr] which contains 2,975 train, 500 validation and 1,525 test video sequences of 1.8 second each, recorded from a car driving in urban environments. Each sequence consists of 30 frames of resolution 10242048, and complete ground-truth semantic and instance segmentation for every pixel are available for the 20-th frame of each sequence.
We employ a Mask R-CNN model pre-trained on the MS-COCO dataset [lin14eccv] and fine-tune it in an end-to-end fashion on the Cityscapes dataset. The coarsest FPN level P5 has resolution 3264, and the finest level P2 has resolution 256512.
Following [luc17iccv], we train our models using a frame interval of three, and taking four frames as input. That is, the input sequence consists of feature pyramids for frames . We denote predicting as short-term and use mid-term prediction to denote predicting up to , corresponding to predicting up to 0.17 sec. and 0.5 sec. respectively.
Conversion to semantic segmentation. For direct comparison to previous work, we also convert our instance segmentation predictions to semantic segmentation. To this end, we first assign all pixels in the semantic segmentation the background label. Then, we iterate over the detected object instances in order of ascending confidence score. For each instance, consisting of a confidence score , a class , and a binary mask , we either reject it if and accept it otherwise, where in our experiments we set . For accepted instances, we update the semantic segmentation at spatial positions corresponding to mask with label . This step potentially replaces labels set by an instance with lower confidence, and resolves competing class predictions.
Evaluation metrics. To measure the instance segmentation performance, we use the standard Cityscapes metrics. The average precision metric AP50 counts an instance as correct if it has at least 50% of intersection-over-union (IoU) with the ground truth instance it has been matched with. The summary AP metric is given by average AP obtained with ten equally spaced IoU thresholds from 50% to 95%. Performance is measured across the eight classes with available instance-level ground truth: person, rider, car, truck, bus, train, motorcycle, and bicycle.
We measure semantic segmentation performance across the same eight classes as instance segmentation. In addition to the IoU metric, computed w.r.t. the ground truth segmentation of the 20-th frame in each sequence, we also quantify the segmentation accuracy using three standard segmentation measures used in [yang08cviu], namely the Probabilistic Rand Index (PRI) [pantofaru05tr], Global Consistency Error (GCE) [martin01iccv], and Variation of Information (VoI) [meila05icml]. Good segmentation results are associated with high RI, low GCE and low VoI.
Implementation details. We cross validate the number of scales, the optimization algorithm, and the parameters per level of the pyramid. This leads to the choice of single scale network for each level of the pyramid, except for F2F, where we employ three scales. The F2F
network is trained for 60K iterations of SGD with Nesterov momentum, with learning rate , and batch size of four images. It is used to initialize the other subnetworks, which are trained for 80K iterations of SGD with Nesterov momentum with batch size of one image, and learning rates of for F2F and for F2F. For F2F, which is much deeper, we used Adam with learning rate and default parameters.
4.2 Baseline models
As a performance upper bound, we report the accuracy of a Mask R-CNN oracle that has access to the future RGB. We also use a trivial copy baseline that returns the segmentation of the last input RGB frame.
Optical flow baselines. We designed two baselines based on optical flow field , from RGB frame to , and the instance segmentation predicted at frame . The Warp
approach consists in warping each instance mask independently using the flow field inside this mask. We initialize a separate flow field for each instance, equal to the flow field inside the instance mask and zero elsewhere. For a given instance, the corresponding flow field is used to project the values of the instance mask in the opposite direction of the flow vectors, yielding a new binary mask. To this predicted mask, we associate the class and confidence score of the input instance it was obtained from. To predict more than one time-step ahead, we also update the instance’s flow field in the same fashion, to take into account the previously predicted displacement of physical points composing the instance. The predicted mask and flow field are used to make the next prediction, and so on. Maintaining separate flow fields allows competing flow values to coexist for the same spatial position, when they belong to different instances whose predicted trajectories lead them to overlap. To smoothen the results of this baseline, we perform post-processing operations at each time step, which significantly improve the results and which we detail in the supplementary material.
Warping the flow field when predicting multiple steps ahead suffers from error accumulation. To avoid this, we test another baseline, Shift, which shifts each mask with the average flow vector computed across the mask. To predict time steps ahead, we simply shift the instance times. This approach, however, is unable to scale the objects, and is therefore unsuitable for long-term prediction when objects significantly change in scale as their distance to the camera changes.
Future semantic segmentation using discrete label maps. For comparison with the future semantic segmentation approach of [luc17iccv], which ignores instance-level labels, we train their S2S model on the label maps produced by Mask R-CNN. Following their approach, we down-sample the Mask R-CNN label maps to . Unlike the soft-label maps from the Dilated-10 network [yu16iclr] used in [luc17iccv]
, our converted Mask R-CNN label maps are discrete. For autoregressive prediction, we discretize the output by replacing the softmax network output with a one-hot encoding of the most likely class at each pixel. For autoregressive fine-tuning, we use a softmax activation with a low temperature parameter at the output of the S2
|Short term||Mid term|
|Mask R-CNN oracle||65.8||37.3||65.8||37.3|
|Copy last segmentation||24.1||10.1||6.6||1.8|
|Optical flow – Shift||37.0||16.0||9.7||2.9|
|Optical flow – Warp||36.8||16.5||11.1||4.1|
|F2F w/o ar. fine tuning||40.2||19.0||17.5||6.2|
4.3 Quantitative results
Future instance segmentation. In Table 1 we present instance segmentation results of our future feature prediction approach (F2F) and compare it to the performance of the oracle, copy, and optical flow baselines. From the results we draw several conclusions. First of all, the copy baseline performs very poorly (24.1% in terms of AP50 vs. 65.8% for the oracle), which underlines the difficulty of the task. The optical flow baselines are much better. While Shift and Warp perform comparably for short-term prediction (37.0% vs. 36.8% AP50 respectively), the Warp approach performs best for mid-term prediction (9.7% vs. 11.1% AP50 respectively). Our F2F approache gives the best results overall, reaching more than 74% of relative improvement over our best mid term baseline.
While for our F2F autoregressive finetuning makes little difference in case of short-term prediction (40.2% vs. 39.9% AP50 respectively), it gives a significant improvement for mid-term prediction (17.5% vs. 19.4% AP50 respectively). For short-term prediction, our F2F improves over the Warp baseline by 3.1% AP50, from 36.8% to 39.9%. For long-term prediction the difference is more pronounced: our F2F improves over the Warp baseline by 8.3% AP50, from 11.1% to 19.4%.