Resources of semantic segmantation based on Deep Learning model
The ability to predict and therefore to anticipate the future is an important attribute of intelligence. It is also of utmost importance in real-time systems, e.g. in robotics or autonomous driving, which depend on visual scene understanding for decision making. While prediction of the raw RGB pixel values in future video frames has been studied in previous work, here we introduce the novel task of predicting semantic segmentations of future frames. Given a sequence of video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future. We develop an autoregressive convolutional neural network that learns to iteratively generate multiple frames. Our results on the Cityscapes dataset show that directly predicting future segmentations is substantially better than predicting and then segmenting future RGB frames. Prediction results up to half a second in the future are visually convincing and are much more accurate than those of a baseline based on warping semantic segmentations using optical flow.READ FULL TEXT VIEW PDF
We consider the problem of predicting semantic segmentation of future fr...
Predicting the future is an important aspect for decision-making in robo...
Anticipating future actions is a key component of intelligence, specific...
Predicting the future to anticipate the outcome of events and actions is...
Given a visual history, multiple future outcomes for a video scene are
In this paper, we present a detailed design of dynamic video segmentatio...
In this work, we introduce the novel problem of identifying dense canoni...
Resources of semantic segmantation based on Deep Learning model
Predicting Deeper into the Future of Semantic Segmentation
Prediction and anticipation of future events is a key component of intelligent decision-making . Building smarter robotic systems and autonomous vehicles implies making decisions based on the analysis of the current situation and hypotheses made on what could happen next 
. While humans can predict vehicle or pedestrian trajectories effortlessly and at the reflex level, it remains an open challenge for current computer vision systems. Besides the long term goal of learning a good representation allowing machines to reason about future events, an application which directly benefits from our work is autonomous driving. In this domain, approaches are either based on a number of semantic decompositions such as road and obstacle detection, or directly learn a mapping from visual input to driving instructions end-to-end. Recent work from Mobileye
demonstrated an advantage of the semantic abstraction approach in lowering the required amount of training data and decreasing the probability of failure. Other work uses future prediction to facilitate long-term planning problems and forms a direct motivation for our work.
The task of predicting future RGB video frames given preceding ones is interesting to assess if current vision systems are able to reason about future events, and it has recently received significant attention [19, 28, 32, 35]. Modeling raw RGB intensities is, however, overly complicated as compared to predicting future high-level scene properties, while the latter is sufficient for many applications. Such high-level future prediction has been studied in various forms, e.g. by explicitly forecasting trajectories of people and other objects in future video frames [1, 12, 15, 21, 23, 29]. In our work we do not explicitly model objects or other scene elements, but instead model the dynamics of semantic segmentation maps of object categories with convolutional neural networks. Semantic segmentation is one of the most complete forms of visual scene understanding, where the goal is to label each pixel with the corresponding semantic label (e.g., tree, pedestrian, car, etc.). In our work, we build upon the recent progress in this area [10, 25, 5, 46, 31, 30, 24], and develop models to predict the semantic segmentation of future video frames, given several preceding frames. See Figure 1 for an illustration.
The pixel-level annotations needed for semantic segmentation are expensive to acquire, and this is even worse if we need annotations for each video frame. To alleviate this issue we rely on state-of-the-art semantic image segmentation models to label all frames in videos, and then learn our future segmentation prediction models from these automatically generated annotations.
We systematically study the effect of using RGB frames and/or segmentations as inputs and targets for our models, and the impact of various loss functions. Our experiments on the Cityscapes dataset suggest that it is advantageous to directly predict future frames at the abstract semantic-level, rather than to predict the low-level RGB appearance of future frames and then to apply a semantic segmentation model on these. By moving away from raw RGB predictions and modeling pixel-level object labels instead, the network’s modeling capacity seems better allocated to learn basic physics and object interaction dynamics.
In this work we make two contributions:
We introduce the novel task of predicting future frames in the space of semantic segmentation. Compared with prediction of the RGB intensities, we show that we can predict further into the future, and hence model more interesting distributions.
Our approach does not require extremely costly temporally dense video annotation and its genericity allows different architectures for still-image segmentation and future segmentation prediction to be swapped in.
Here we discuss the most relevant related work on video forecasting and on disambiguating learning under uncertainty, in particular using adversarial training.
Video forecasting. Several authors developed methods related to our work to improve the temporal stability of semantic video segmentation. Jin et al.  train a model to predict the semantic segmentation of the immediate next image from the preceding input frames, and fuse this prediction with the segmentation computed from the next input frame. Nilsson and Sminchisescu  use a convolutional RNN model with a spatial transformer component  to accumulate the information from past and future frames in order to improve prediction of the current frame segmentation. In a similar spirit, Patraucean et al.  employ a convolutional RNN to implicitly predict the optical flow, and use these to warp and aggregate per-frame segmentations. In contrast, our work is focused on predicting future segmentations without seeing the corresponding frames. Most importantly, we target a longer time horizon than a single frame.
developed a Long Short Term Memory (LSTM) architecture for the task, and demonstrated a gain in action classification using the learned features. Mathieu et al.  improved the predictions using a multi-scale convolutional architecture, adversarial training , and a gradient difference loss. A similar training strategy was employed for future frame predictions in time-lapse videos 
. To reduce the number of parameters to estimate, several authors reparameterize the problem to predict frame transformations instead of raw pixels[11, 37]. Luo et al.  employ a convolutional LSTM architecture to predict sequences of up to eight frames of optical flow in RGB-D videos. The video pixel network of Kalchbrenner et al.  combine LSTMs for temporal modeling with spatial autoregressive modeling. Rather than predicting pixels or flows, Vondrick et al.  instead predict features in future frames. They predict the activations of the last hidden layer of AlexNet  in future frames, and use these to anticipate objects and actions.
Learning under uncertainty. Generative adversarial networks (GANs) 
and variational autoencoders (VAEs) are deep latent variable models that can be used to deal with the inherent uncertainty in future-prediction tasks. An interesting approach using GANs for unsupervised image representation learning was simultaneously proposed in  and , where the generative model is trained along with an inference model that maps images to their latent representations. Vondrick et al.  showed that GANs can be applied to video generation. They use a two-stream generative model: one stream generates a static background, while the other generates a dynamic foreground sequence which is pasted on the background. Yang et al.  use similar ideas to develop an iterative image generation model where objects are sequentially pasted on the image canvas using a recurrent GAN. Xue et al.  predict future video frames from a single given frame using a VAE approach. Similarly, Walker et al.  perform forecasting with a VAE, predicting feature point trajectories from still images.
We start by presenting different scenarios to predict RGB pixel values and/or segmentations of the next video frame. In Section 3.2 we describe two extensions of the single-frame prediction model to predict further into the future.
Pixel-level supervision is laborious to acquire for semantic image segmentation, and even more so for its video counterpart. To circumvent the need for datasets with per-frame annotations, we use the state-of-the-art Dilation10 semantic image segmentation network  to provide input and target semantic segmentations for all frames in each video. We use the resulting temporally dense segmentation sequences to learn our models.
Let us denote with the -th frame of a video sequence and denote the sequence of frames from to as . We denote by the semantic segmentation of frame given the Dilation10 network. We represent the segmentations
using the final softmax layer’s pre-activations, rather than the probabilities it produces. This is motivated by recent observations in network distillation that the softmax pre-activations carry more information[3, 14]. For single-frame future prediction, we consider five different models that differ in whether they take RGB frames and/or segmentations as their inputs and targets: model X2X takes and predicts , model S2S takes and predicts , models XS2X and XS2S take and predict respectively and , and finally model XS2XS takes and predicts .
Architectures. Model X2X is a next frame prediction model, for which we use the multi-scale network of Mathieu et al.  with two spatial scales. Noting
the number of output channels, each scale module is a four-layer convolutional network alternating convolutions and ReLU operations, outputting feature maps with 128, 256, 128,channels each, and filters of size 3 for the smaller scale, and 5, 3, 3, 5 for the larger scale. The last non-linear function is a hyperbolic tangent, to ensure that the predicted RGB values lie in the range . The output at a coarser scale is upsampled, and used in input to the next scale module together with a copy of the input at that scale.
For models that predict segmentations , we removed the last hyperbolic tangent non-linearities for the corresponding output channels, since the softmax pre-activations are not limited to a fixed range. Apart from this difference, the S2S model, that predicts the next segmentation from past ones, has the same architecture as the X2X model.
The multi-scale architecture of the S2S model is illustrated in Figure 2. The other models (XS2X, XS2S, and XS2XS), which take both RGB frames and segmentation maps as input, also use the same internal architecture.
Loss function. Following , for all models, the loss function between the model output
and the target outputis the sum of an loss and a gradient difference loss:
Using to denote the pixel elements in , and similarly for , the losses are defined as:
where denotes the absolute value function. The loss tries to match all pixel predictions independently to their corresponding target values. The gradient difference loss, instead, penalizes errors in the gradients of the prediction. This loss is relatively insensitive to low-frequency mismatches between prediction and target (e.g., adding a constant to all pixels does not affect the loss), and is more sensitive to high-frequency mismatches that are perceptually more significant (e.g. errors along the contours of an object). We present a comparison of this loss with a multiclass cross entropy loss in Section 4.
Adversarial training. As shown by Mathieu et al.  in the context of raw images, introducing an adversarial loss allows the model to disambiguate between modes corresponding to different turns of events, and reduces blur associated with this uncertainty. Luc et al.  demonstrated the positive influence of adversarial training for semantic image segmentation.
Our formulation of the adversarial loss term is based on the recently introduced Wasserstein GAN , with some modifications for the semantic segmentation application. In the case of the S2S model, the parameters of the discriminator are trained to maximize the absolute difference between its output for ground truth sequences and sequences predicted by our model:
The outputs produced by the predictive model are softmax pre-activation maps with unbounded values. In the Wasserstein GAN they are encouraged to grow indefinitely. To avoid this and stabilize training, we employ an additional sigmoid non-linearity at the output of the discriminator, and set explicit targets for two kinds of outputs: for generated sequences and for real training sequences, set to to prevent saturation.
The adversarial regularization term for our predictive model (i.e. the “generator”) then takes the following form:
The structure of the discriminator network is derived from the two-scale architecture described above. Additional details are provided in the supplementary material.
We consider two extensions of the previous models to predict further into the future than a single frame. The first is to expand the output of the network to comprise a batch of frames, i.e. to output and/or . We refer to this as the “batch” approach. The drawback of this approach is that it ignores the recurrent structure of the problem. That is, it ignores the fact that depends on in the same manner as depends on . As a result, the capacity of the model is split to predict the output frames, and the number of parameters in the last layer scales linearly with the number of output frames.
In our second approach, we leverage the recurrence property, and iteratively apply a model that predicts a single step into the future, using its prediction for time as an input to predict at time , and so on. This allows us to predict arbitrarily far into the future in an autoregressive manner, without resources scaling with the number of time-steps we want to predict. We refer to this approach as “autoregressive”. See Figure 3 for a schematic illustration of the two extensions for multiple time-step predictions.
Before presenting our experimental results, we first describe the dataset and evaluation metrics in Section4.1. We then present results on short-term (i.e. single-frame) prediction, mid-term prediction (0.5 sec.), and long-term prediction (10 sec.).
The Cityscapes dataset  contains 2,975 training, 500 validation and 1,525 testing video sequences of 1.8 second. Each sequence consists of 30 frames, and a ground-truth semantic segmentation is available for the 20-th frame. The segmentation outputs of the Dilation10 network  are produced at a resolution of and we perform all experiments at this resolution. For this purpose, we also downsample RGB frames and ground truth to this resolution. We report performance of our models on the Cityscapes validation set, and refer to the supplementary material for results on the test set.
We assess performance using the standard mean Intersection over Union (IoU) measure, computed with respect to the ground truth segmentation of the 20-th frame in each sequence (IoU GT). We also compute the IoU measure with respect to the segmentation produced using the Dilation10 network  for the 20-th frame (IoU SEG). The IoU SEG metric allows us to validate our models with respect to the target segmentations from which they are trained. Finally, we compute the mean IoU across categories that can move in the scene: person, rider, car, truck, bus, train, motorcycle, and bicycle (IoU-MO, for “moving objects”).
To evaluate the quality of the frame RGB predictions, we compute the Peak Signal to Noise Ratio (PSNR) and the Structural Similarity Index Measure (SSIM) measures . The SSIM measures similarity between two images, ranging between for very dissimilar inputs to when the inputs are the same. It is based on comparing local patterns of pixel intensities normalized for luminance and contrast.
Unless specified otherwise, we train our models using a frame interval of 3, and taking 4 frames and/or segmentations as input. That is, the input sequence consists of frames , and similarly for segmentations. We performed patch-wise training with patches for the largest scale resolution, enabling equal class frequency sampling as in , using mini-batches of four patches and a learning rate of 0.01.
In our first experiment, we compare the five different input-output representations. For models that do not directly predict future segmentations, we generate segmentations using the Dilation10 network based on the predicted RGB frames. We also include two baselines. The first baseline copies the last input frame to the output. The second baseline estimates the optical flow between the last two inputs, and warps the last input using the estimated flow. Further details are given in the supplementary material. Comparison with tracking-based approaches is difficult since (i) segmentation is performed densely and lacks the notion of object instances used by object trackers, and (ii) “stuff” categories (road, vegetation, etc.), useful for drivable area detection in the context of autonomous driving, are not suitable for modeling with tracking-based approaches.
|GT (SEG)||GT (SEG)|
|Copy last input||20.6||0.65||49.4 (54.6)||43.4 (48.2)|
|Warp last input||23.7||0.76||59.0 (67.3)||54.4 (63.3)|
|Model X2X||24.0||0.77||23.0 (22.3)||12.8 (11.4)|
|Model S2S||—||—||58.3 (64.9)||53.8 (59.8)|
|24.2||0.77||22.4 (22.5)||10.8 (10.0)|
|Model XS2S||—||—||58.2 (64.6)||53.7 (59.9)|
|Model XS2XS||24.0||0.76||55.5 (61.1)||50.7 (55.8)|
|Model S2S-adv.||—||—||58.3 (65.0)||53.9 (60.2)|
|Model S2S-dil||—||—||59.4 (66.8)||55.3 (63.0)|
In Figure 4, we show qualitative results of the predictions for one of the validation sequences. From the quantitative result in Table 1 we make several observations. First, in terms of RGB frame prediction (PSNR and SSIM), the performance is comparable for the three models X2X, XS2X, and XS2XS, and improves over the two baselines. This shows that our models learn non-trivial scene dynamics in the RGB pixel space, and that adding semantic segmentations either at input and/or output does not have a substantial impact on this ability.
Second, in terms of the IoU segmentation metrics, the models that directly predict future segmentations (S2S, XS2S, XS2XS) perform much better than the models that only predict the RGB frames. This suggests that artifacts in the RGB frame predictions degrade the performance of the Dilation10 network. See also the corresponding RGB frame predictions in Figure 4.
Third, the XS2XS model, which predicts both segmentations and RGB frames performs somewhat worse than the models that only predict segmentations (S2S and XS2S), suggesting that some of the modeling capacity is compromised by jointly predicting the RGB frames.
Fourth, we find that fine-tuning the S2S model using adversarial training (S2S-adv) does not lead to a significant improvement over normal training.
Table 2 presents results of an ablation study of the S2S model, assessing the impact of the different loss functions, as well as the impact of using one or two scales. We include the results obtained using the Dilation10 model as an “oracle”, that predicts the future segmentation based on the future RGB frame, which is not accessible to our other models. This oracle result gives the maximum performance that could be expected, since this oracle was used to provide the training data - we can thus only expect our models to have at best comparable performance with this oracle. All variants of the S2S model were trained during about 960,000 iterations, taking about four days of training on a single GPU. The results show that using two scales improves the performance, as does the addition of the gradient difference loss. Training with the and/or gdl loss on the softmax pre-activations gives better results as compared to training using the multi-class cross-entropy (MCE) loss on the segmentation labels. This is in line with observations made in network distillation [3, 14].
Finally, we perform further architecture exploration for the S2S model, which performed best. We propose a simpler, deeper, and more efficient architecture with dilated convolutions , to expand the field of view while retaining accurate localization for the predictions. We call this model S2S-dil, and provide details in the supplementary material. This model gives best overall results, reported in Table 1.
|Model||IoU GT||IoU SEG||IoU-MO GT|
|S2S, 2 scales, +gdl||58.3||64.9||53.8|
|S2S, 1 scale, +gdl||57.7||63.9||52.6|
|S2S, 2 scales,||57.6||64.0||53.2|
|S2S, 2 scales, MCE||55.5||60.9||49.7|
We now address the more challenging task of predicting the mid-term future, i.e. the next 0.5 second. In these experiments we take in input frames 2, 5, 8, and 11, and predict outputs for frames 14, 17 and 20. We compare different strategies: batch models, autoregressive models (AR), and models with autoregressive fine-tuning (AR fine-tune). We compare these strategies to our two baselines consisting in copying the last input, and the second one relying on optical flow. For the optical flow baseline, after the first prediction, we also warp the flow field so that the flow is applied to the correct locations at the next time-step, and so on. Qualitative prediction results are shown in Figure 5. For models XS2X and XS2S, the autoregressive mode is not used because either the frame or the segmentation input are missing for predicting from the second output on.
|Frame 14||Frame 20|
|Copy last input||20.4||0.64||18.0||0.55|
|Warp last input||23.5||0.76||19.4||0.59|
|Model||IoU GT||IoU SEG||IoU-MO GT|
|Copy last input||36.9||39.2||26.8|
|Warp last input||44.3||47.2||37.0|
|S2S, AR, fine-tune||46.7||49.7||39.3|
|S2S-dil, AR, fine-tune||47.8||50.4||40.8|
The results for RGB frame prediction in Table 3 show that for frame 14, all models give comparable results, consistently improve over the copy baseline and perform somewhat better than the warping baseline. For frame 20, the batch models perform best. On the contrary, when predicting segmentations, we find that the autoregressive models perform better than the batch models, as reported in Table 4. This is probably due to the fact that the single-step predictions are more accurate for segmentation, which makes them more suitable for autoregressive modeling. For RGB frame prediction, errors accumulate quickly, leading to degraded autoregressive predictions. Among the batch models, using the images as input (XS2S model) slightly helps. Predicting both the images and segmentation (XS2XS model) performs worst, the image prediction task presumably taking up resources otherwise available for modeling the dynamics of the sequence.
Model S2S is the most effective, as it can be applied in autoregressive mode, and outperforms XS2XS in this setting. In Figure 5 we compare different versions of this model. Visually, the first sequence shows some improvements using the autoregressive fine-tuned model, by more accurately matching contours of the moving cars than the other strategies. The second sequence displays typical failures of the optical flow baseline, where certain values cannot be estimated because they correspond to points that were not present in the input, e.g. those at the back of the incoming car, and must be filled using a standard region filling algorithm. This sequence also displays some improvements of the adversarial fine-tuning on the car contours. More examples are present in the supplementary material, where we can observe that difficult cases for our method include dealing with occlusions and with fast ego-motion.
To evaluate the limits of our S2S autoregressive models on arbitrarily long sequences, we use them to make predictions of up to ten seconds into the future. To this end, we evaluate our models on ten sequences on 238 frames extracted from the long Frankfurt sequence of the Cityscapes validation set. Given four segmentation frames with a frame interval of 17 images, corresponding to exactly one second, we apply our models to predict the ten next ones. In Figure 7 we report the IoU SEG performance as a function of time. In this extremely challenging setting, the predictive performance quickly drops over time. Fine-tuning the model in autoregressive mode improves its performance, but only gives a clear advantage over the input-copy baseline for predictions at one and two seconds ahead. We also applied our model with a frame interval of 3 to predict up to 55 steps ahead, but found this to perform much worse. Figure 6 shows an example of predictions compared to the actual future segmentations. The visualization shows that our model averages the different classes into an average future, which is perhaps not entirely surprising. Sampling different possible futures using a GAN or VAE approach might be a way to resolve this issue.
To evaluate the generalization capacity of our approach, we test our S2S model on the Camvid dataset , specifically on the test set of 233 images with 11 classes grouping employed in . Ground truth segmentations are provided for every second on 30 fps video sequences. We first generate the Dilation10 segmentations - without fine-tuning the oracle to the CamVid dataset - using a frame interval of 5, roughly corresponding to a frame interval of 3 on Cityscapes. We note that the class correspondence between Cityscapes and CamVid is not perfect; for instance we associate the class “tree” to “vegetation”. As reported in Table 5, our models have very good mid-term performance on this dataset, considering the oracle results. For reference,  reports an IoU of 65.3 using a fine-tuned Dilation8.
|Dilation10||Copy last||Warp last||S2S|
We introduced a new visual understanding task of predicting future semantic segmentations. For prediction beyond a single frame, we considered batch models that predict all future frames at once, and autoregressive models that sequentially predict the future frames. While batch models were more effective in the RGB intensities space because of otherwise large error propagation, the more desirable autoregressive mode was more accurate in the semantic segmentation space, supporting with experimental evidence our motivation for this new task. The autoregressive mode lends itself naturally to predicting sequences of arbitrary length, thanks to which we can aim to model more interesting distributions.
In this respect, there is still room for improvement. Where the Dilation10 network for semantic image segmentation gives around 69 IoU, this drops to about 59 when predicting 0.18s ahead and to about 48 for 0.5s. Most predicted object trajectories are reasonable, but do not always correspond to the actual observed trajectories. GAN or VAE models may be useful to address the inherent uncertainty in the prediction of future segmentations. We open-source our Torch-based implementation, and invite the reader to watch videos of our predictions athttps://thoth.inrialpes.fr/people/pluc/iccv2017.
Acknowledgment. This work has been partially supported by the grant ANR-16-CE23-0006 “Deep in France” and LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01). We thank Michael Mathieu, Matthijs Douze, Hervé Jegou, Larry Zitnick, Moustapha Cisse, Gabriel Synnaeve and anonymous reviewers for their precious comments.
NIPS Deep Learning Workshop, 2014.