1 Introduction
The performance of autonomous systems, such as mobile robots or selfdriving cars, is heavily influenced by their ability to generate a robust representation of the current environment. Errors in the environment representation are propagated to subsequent processing steps and are hard to recover. For example, a common error is a missed detection of an object, which might lead to a fatal crash. In order to increase the reliability and safety of autonomous systems, robust methods for observing and interpreting the environment are required.
Deep learning based methods have greatly advanced the stateoftheart of perception systems. Especially visionbased perception benchmarks (e.gCityscapes [Cordts et al. (2016)] or Caltech [Dollár et al. (2009)]) are dominated by approaches utilizing deep neural networks. From a safety perspective, a major disadvantage of such benchmarks is that they are recorded at daytime under idealized environment conditions. To deploy autonomous systems in an open world scenario without any human supervision, one not only has to guarantee their reliability in good conditions, but also has to make sure that they still work in challenging situations (e.gsensor outages or heavy weather). One source of such challenges are perturbations inherent in the data, which significantly reduce the information provided to the perception system. We denote failures originating from datainherent perturbations in accordance to the classification of uncertainties [Kiureghian and Ditlevsen (2009), Kendall and Gal (2017)] as aleatoric failures. These failures cannot be resolved using a more powerful model or additional training data. To solve aleatoric failures, one has to enhance the information provided to the perception system. This can be achieved by fusing the information of multiple sensors, utilizing context information or by considering temporal information. A second class of failures are epistemic failures, which are model or dataset dependent. They can be mitigated by using more training data and/or a more powerful model [Kiureghian and Ditlevsen (2009)].
In this work, we focus on tackling aleatoric failures of a single frame semantic segmentation model using temporal consistency. Temporal integration is achieved by recurrently filtering a representation of the model using a functionally modularized filter (Fig. 1). In contrast to other available approaches, our filter consists of multiple submodules, decomposing the filter task into less complex and more transparent subtasks. The basic structure of the filter is inspired by a Bayes estimator, consisting of a prediction step and an update step.
We model the prediction of the representation as an explicit geometric projection given estimates of the scene geometry and the scene dynamics. The scene geometry and dynamics are represented as a per pixel depth and a 6DoF camera motion. Both parameters are estimated within the filter using two taskspecific subnetworks.
The decomposition of the prediction task into a modelbased transformation as well as a depth and a motion estimation introduces several advantages. Instead of having to learn dynamics of a highdimensional representation, we now can model motion separately in a lowdimensional space. The overall filter can therefore be subdivided into two subfilters: A motion filter, which predicts and integrates lowdimensional camera motion and a feature filter, which handles the integration and prediction of abstract scene features.
An advantage of our approach is its improved transparency, interpretability and explicitness. Within the filter, we estimate two human interpretable representations: a depth map and a camera motion (Fig 1, blue boxes). These representations can be used to inspect the functionality of the model, to split the filter into pretrainable subnetworks, or to debug and validate network behavior. Besides its modularity, our model is trainable in an endtoend fashion. In contrast to other methods, the proposed filter also works in cases when the current image is not available. Methods that for example use the optical flow fail in such situations due to their inability to compute a meaningful warping.
2 Related Work
In this section, we give an overview of approaches that use temporal information to improve segmentation models against aleatoric failures.
Featurelevel temporal filtering. A common approach to temporally stabilize network predictions are featurelevel filters. These filters are applied to one or several feature representations, which are integrated using information of previous time steps. Several works implement such a filter using fully learned, modelfree architectures. Fayyaz et al. (2016) and Valipour et al. (2017)
generate a feature representation for each image in a sequence and use recurrent neural networks to temporally filter them.
Jin et al. (2016) utilize a sequence of previous images to predict a feature representation of the current image. The predicted representation is fused with the one of the current image and propagated through a decoder network. The Recurrent Fully Convolutional DenseNet Wagner et al. (2018) utilizes a hierarchical filter concept to increase the robustness of a segmentation model. Being modelfree, these filters require many parameters and are therefore harder to train. Due to their low interpretability, it is quite difficult to include constraints and to inspect or debug their behavior.A second class of featurelevel filters utilizes a partially modelbased approach to integrate features. These approaches use an explicit model to implement the temporal propagation of features and learn a subnetwork to fuse the propagated features with features of the current time step. A common model to implement the propagation is optical flow. The replacement field parametrizing the flow can be predicted in the model Vu et al. (2018) or computed using classical methods Gadde et al. (2017); Nilsson and Sminchisescu (2016). These models are well suited to reduce epistemic failures, but often fail to resolve aleatoric failures. This is due to their dependence on the availability of the current frame. More sophisticated feature propagation models exist Zhou et al. (2017); Mahjourian et al. (2016, 2018); Yin and Shi (2018), which additionally constrain the transformation. Such a model was recently used to temporally aggregate learned features within a multitask model Radwan et al. (2018). Our model is also partially modelbased and utilizes a more sophisticated propagation model similar to Radwan et al. (2018). In contrast to all presented modelbased approaches, our filter is not dependent on the availability of the current frame.
Postprocessing based temporal integration. Some approaches use postprocessing steps to integrate the predictions of singleframe segmentation models. Lei and Todorovic (2016) propose the RecurrentTemporal Deep Field
model for video segmentation, which combines a convolutional neural network, a recurrent temporal restricted Boltzmann machine, as well as a conditional random field.
Kundu et al. (2016) propose a longrange spatiotemporal regularization using a conditional random field operating on a feature space, optimized to minimize the distance between features associated with corresponding points.Our temporal integration approach differs from these postprocessing methods, due to the integration of rich feature representations instead of segmentations. The modular structure of our filter, with its human interpretable representations, makes it also more transparent.
Spatiotemporal fusion. Other approaches build semantic video segmentation networks using spatiotemporal features. Tran et al. (2016) and Zhang et al. (2014) use 3D convolutions to compute such features. The Recurrent Convolutional Neural Network of Pavel et al. (2017) is another spatiotemporal architecture. This method uses layerwise recurrent selfconnections as well as topdown connections to stabilize representations. These approaches require a large number of parameters. Additionally, it is quite difficult to integrate physical constraints.
3 Functionally Modularized Temporal Filtering
The aim of this work is to improve the robustness of a deep neural network , which receives a measurement and produces a pixelwise semantic segmentation . We assume the model consists of two parts: a feature encoder and a semantic decoder . The feature encoder generates an abstract feature representation of the image . This representation is upsampled and refined by the semantic decoder to produce a dense segmentation :
(1) 
Due to datainherent perturbations, the representation is an approximation of the true feature representation without perturbations. Using a temporal filter , we try to improve the estimate of the features and, as a result, the estimate of the semantic decoder:
(2) 
All prior knowledge about scene features and dynamics, aggregated from previous time steps, is encoded in the hidden state .
Framing Eq. 2 in the context of a Bayesian estimator, the recurrent filter module has to propagate the belief about the hidden state one time step into the future, update the belief using the current filter input , and compute an improved estimate of the true feature representation. To make our filter module more transparent, we adopt the basic structure of a Bayesian estimator and split the filter into a prediction and update module:
(3) 
The prediction module propagates the hidden state, while the update module refines it using information in for deriving an improved estimate of the encoder representation . The prediction module therefore has to learn the complex dynamics of a highdimensional hiddenstate. To increase explainability and divide the prediction task into easier subtasks, we split the hidden state into a highdimensional static state encoding all scene features and a lowdimensional dynamic state encoding scene dynamics (Fig. 1).
The prediction of can now be performed fully modelbased using a geometric projection . This is possible, since the prediction only has to account for spatial feature displacements. To compute a valid projection, estimates of the scene geometry and the scene dynamics are required. We encode the scene geometry as a perpixel depth and derive it from the static hidden state by means of a depth decoder . A 3D rigid transformation is used to characterize the scene dynamics, assuming the dynamics are dominated by camera motion. The predicted static hidden state is updated in a second module using the new information of the input . The two modules and form the static feature filter.
Scene dynamics, represented by a lowdimensional state , are filtered in a second subfilter. A motion estimation module is used to project from the highdimensional scene feature space into a lowdimensional motion feature space. The transformation is fully learned, enabling the model to generate a representation wellsuited for motion integration.
By decoupling motion and scene features, it is much easier to incorporate auxiliary information such as acceleration data of the sensor. This kind of information can now be fused (see module in Fig. 1) much more targeted with the appropriate motion features derived from image pairs (see module in Fig. 1). An additional advantage of the decoupling is a global modelling of camera motion. The motion is guaranteed to be consistent across spatial scene features and can be estimated using correlations across full image pairs.
Another way of looking at our model is that it consists of an undelying multitask model:
(4) 
predicting a segmentation, a depth map and a 3D rigid transformation. The encoder representation is integrated over time using an additional filter module, which utilizes decoder outputs to propagate previous knowledge. As a result, decoders operate on a filtered encoder representation or are filtered separately (see ), making the functionality of our model not dependent on new inputs . This property sets our filter apart from other approaches.
The overall filter is setup to increase transparency and interpretability, by modularizing functionalities, using modelbased computations, and introducing human interpretable representations. Compared to other architectures, it is hence much easier to debug and validate the model, inspect intermediate results and pretrain subnetworks. These properties are also particularly relevant with regard to safety analysis. From a multitask perspective, the two auxiliary tasks may also benefit segmentation, due to the implicit regularization Ruder (2017).
3.1 Feature Filter
Depth Estimation. We compute a perpixel depth using a decoder network operating on the filtered representation . The depth decoder consists of three convolutional layers with kernel size 33, 11, and 1
1, respectively. We apply batch normalization and use ReLU nonlinearities in each layer. The predicted depth is therefore always positive and valid. The number of features in the first two layers is set to
and the last layer predicts one value per pixel. Instead of directly predicting depth values, we let the decoder provide the inverse depth , which puts less focus on wrong predictions in larger distance. For supervision during training, we use two losses. A L1 loss on the inverse depth:(5) 
and a scaleinvariant gradient loss Ummenhofer et al. (2017) to take dependencies of depths into account:
(6)  
(7) 
Prediction / Geometric Projection. To make the prediction of features more explicit, we use a geometric projection Zhou et al. (2017); Mahjourian et al. (2016). Let be the coordinates of each pixel at time step and the camera intrinsic matrix. The projection can be implemented as:
(8) 
To keep the notation short, we avoided all conversions related to homogeneous coordinates. The coordinates are continuous and have to be discretized. Additionally, it is necessary to account for ambiguities, in cases where multiple pixels at time step are assigned to the same pixel at time step . We resolve these ambiguities by using the transformed pixels with smaller depth (objects closer to the sensor). This projection is differentiable with respect to scene features. In contrast to other methods, our implementation does not depend on information of time step . This is an important property for resolving aleatoric failures.
Update / Feature Fusion. The update module enables the network to weight the predicted representation and the input representation depending on the information of these two representations (datadependent weighting). For each pixel position a weighting value is estimated that indicates whether one can rely on prior knowledge () or on information of new inputs (). This weight matrix is calculated similarly to convolution LSTM gates, but contains only one value per pixel instead of one value per pixel and feature:
(9) 
The convolutional operator is indicated by , and are 33 kernels, and is a bias. Using and elementwise multiplications , the update module computes:
(10) 
3.2 Motion Filter
The motion filter consists of two components: a motion estimator and a motion integration module. Both modules are modelfree and learned during training (see Section 4.1).
Motion Estimation. Using the decoder network , the motion estimation module learns a projection from the highdimensional scene feature space to the lowdimensional motion feature space. To stabilize motion estimates, the projected features are combined in the fusion module with acceleration data of the camera. The motion estimation module is depicted in Fig. 2. Pairs of encoder representations concatenated along the feature dimension are used as the input of the motion decoder. We apply batch normalization and utilize ReLU nonlinearities in each convolutional and fully connected layer.
Temporal Motion Integration. If the input is noisy, the features
computed in the motion estimator contain only limited information. In order to still obtain a meaningful motion estimate, we integrate motion features over time in a modelfree filter. This filter is based on a gated recurrent unit (GRU)
Cho et al. (2014) and defined by:(11)  
(12)  
(13) 
To infer the 3D rigid camera transformation from the filtered hidden state , we propagate it through two additional fully connected layers with and
features, respectively. The output layer predicts the translation vector
and the sinus of the rotation angles . The first layer applies batch normalization and uses a ReLU nonlinearity and the output layer uses no nonlinearity for the translation vector and clips the angle sinus estimates to . Based on the clipped values, we compute the rotation matrix .Motion Supervision. All parameters of the motion filter are trained using groundtruth camera translation vectors and rotation matrices . The losses are based on the relative transformation between the predicted and groundtruth motion as defined by Vijayanarasimhan et al. (2017):
(14)  
(15) 
4 Experiments
4.1 Implementation Details
Dataset. We evaluate our filter using the SceneNet RGBD McCormac et al. (2017) dataset which consists of 5M photorealistically rendered RGBD images recorded from 15K indoor trajectories. Besides the camera motion, all scenes are assumed static. Due to its simulated nature, the dataset provides labels for semantic segmentation, depth estimation, and camera motion estimation. We split the training data into a training and validation set and use the provided validation data to setup the test set. For training, we use all nonoverlapping sequences of length 7 generated from the training trajectories. The test set is constructed by sampling 5 nonoverlapping sequences of length 7 from each test trajectory, resulting in 5,000 test sequences.
To add aleatoric uncertainty, all sequences are additionally perturbed with noise, clutter, and changes in lighting conditions. Noise is simulated by adding zeromean Gaussian noise to each pixel. Clutter is introduced by setting subregions of each image to the pixel mean, computed on a per sequence basis. The clutter is generated once per sequence and applied to each frame. Thus, the resulting clutter pattern is the same in each frame, comparable to dirt on the camera lens. To simulate rapid changes in lighting conditions, we increase or decrease the intensity of frames by a random value and let this offset decay over time. Such a noise pattern occurs, for example, when the light is suddenly switched off in a room. We include a more detailed description of the used perturbations in the supplementary material.
Unfiltered Baseline. We use the Pyramid Scene Parsing Network (PSPNet) Zhao et al. (2017) as the basis for all architectures (Fig. 3, highlighted in green). The used PSPNet is comparatively small to keep the computational effort and the required memory of the filtered models manageable.
To train our filter module, we additionally need groundtruth depth maps and camera motions. In order to make the comparison of the resulting filtered architecture with the unfiltered baseline fairer, we use a multitask version of the PSPNet (MPSPNet) in the evaluation. This model operates on image pairs and additionally predicts camera motion and perpixel depth maps. It can thus also take advantage of all the benefits of multitask learning Ruder (2017). The full MPSPNet (see Fig. 3) uses the depth decoder introduced in Section 3.1 as well as the motion decoder introduced in Section 3.2. To predict a valid rigid transformation, we reuse the last fully connected layer of the motion filter .
Filtered models. Building upon MPSPNet, we setup our filtered version using the functionally modularized filter concept introduced in Section 3. We call our filtered model FMTNet.
As an additional temporally filtered baseline, we use a modelfree, featurelevel filter. Such a filter is well suited to solve aleatoric failures, as it does not necessarily require information of the current frame. We use a filter module (denoted by MFF) similar to the one introduced in Wagner et al. (2018) (Fig. 4). This filter module receives the representation of MPSPNet as input and generates an improved estimate (cf. Eq. 2). In the following,
we will refer to the MPSPNet with modelfree filter MFF as MFFMPSPNet. To be comparable with respect to the filter complexity, the number of parameters in the filter MFF matches the number of parameters in our modularized filter. In the case of our filter, we count the parameters of the depth and motion decoder to the filter, since these decoders are required for filtering. The use of all three decoders in the MFFMPSPNet guarantees comparable training signals, but is not necessary. Hence, we do not assign the depth and motion decoder weights to the filter MFF, resulting in MFFMPSPNet having 1.4 times the parameters of FMTNet.
Training Procedure. All models MPSPNet, FMTNet, and MFFMPSPNet are trained using the multitask loss introduced by Kendall and Gal (2017), which learns the optimal weighting between the crossentropy segmentation loss, the two depth losses, as well as the two motion losses. We train using Adam Kingma and Ba (2014)
with a weight decay of 0.0001 and apply dropout with probability 0.1 in the decoders. All components of FMTNet and MFFMPSPNet that do not belong to the filter are initialized with the corresponding weights of the trained MPSPNet.
Due to its modularity, we can additionally pretrain two components of our filter. First, we pretrain all weights of the motion filter, while keeping the encoder weights fixed. Second, we pretrain the weights of the feature update module as well as the encoder, while keeping all decoders fixed. The second training is performed with sequences containing the same image, perturbed with aleatoric noise. Finally, we finetune the overall architecture.
4.2 Evaluation
To evaluate the segmentation performance, we use the Mean Intersection over Union (Mean IoU) on test sequences, computed with 13 classes: bed, books, ceiling, chair, floor, furniture, objects, painting, sofa, table, TV, wall, and window.
In the following two experiments, we first evaluate the motion filter and the update module of the feature filter on toylike data. In the third experiment, we compare our approach with the unfiltered and filtered baseline using the test dataset described in Section 4.1.
Static Feature Integration. To test the functionality of the feature update module, we use a separate static toydataset with sequences of length four (see Fig. 5a). Each frame of a
sequence contains the same clean image (random image of the SceneNet RGBD dataset without any of the in Section 4.1 introduced aleatoric perturbations), 50 of which is replaced by Gaussian noise. We finetune the encoder network and feature update module of the FMTNet on the toydata. Due to the static nature of the sequences, we remove the motion filter and use the identity transformation (static camera) in the feature filter.
As shown in Fig. 5, FMTNet integrates information over time. It has learned a meaningfull datadependent weighting between previous information stored in the hidden filter state and information provided by new frames (see weights in Fig. 5d). The same behavior can be seen in Tab. 5, which reports the Mean IoU on a perframe basis, computed using 300,000 test sequences. The performance of our model increases over time due to new information.
Temporal Motion Integration. In order to obtain a meaningful motion estimate for images that do not contain any information, it is essential to propagate and aggregate dynamics over time. Using a dynamic toydataset, we evaluate the ability of our motion filter to perform these two tasks. The dataset contains sequences of length 10 for which we have replaced the last five frames with Gaussian noise. In Fig. 6 and Tab. 2, we report the performance of our motion filter, which has been finetuned on the dynamic toydata. We use the translation norm (Eq. 14) and the rotation angle (Eq. 15
) of the relative transformation between predicted and goundtruth motion as evaluation metrics.
Frame 12  Frame 23  Frame 34  Frame 45  Frame 56  Frame 67  Frame 78  Frame 89  Frame 910  


In the first four computation steps, the translation norm and rotation angle decrease, as the filter integrates information. In the next five steps, the filter still delivers meaningful predictions, which slowly get worse due to accumulating errors. In Fig. 6, we show the successive projection of the first frame, computed with groundtruth motions and predicted motions, respectively. We use the groundtruth depth maps for both successive projections.
Comparison with baselines. To compare our model with the introduced baselines, we use the test set described in Section 4.1. In Tab. 3, we report the Mean IoU of all models on a perframe basis. Results show clear superiority of the filtered models (MFFPSPNet, FMTNet), compared to the unfiltered baseline (MPSPNet). Only for the first frame, MPSPNet outperforms the filtered architectures. This is most likely due to not yet wellinitialized hidden filter states. Our model surpasses the other filtered baseline and does not seem to be so strongly affected by poorly initialized hidden states. Unexpectedly, the performance of our model decreases again from Frame 5 forward. We suspect that this is due to the fairely simple design of our feature update module. A more sophisticated fusion approach could counter this behavior. We plan to further investigate this deficiency in the future. An example prediction of FMTNet is included in the supplementary material.
Frame 1  Frame 2  Frame 3  Frame 4  Frame 5  Frame 6  Frame 7  

MPSPNet  
MFFPSPNet  
FMTNet (ours) 
5 Conclusion
In this paper, we have introduced a functionally modularized temporal representation filter to tackle aleatoric failures of a single frame segmentation model. The main idea behind the filter is to decompose the filter task into less complex and more transparent subtasks. The resulting filter consists of multiple submodules, which can be pretrained, debugged, and evaluated independently. In contrast to many other approaches in the literature, our filter also works in challenging situation, e.gbrief sensor outages. Using a simulated dataset, we showed the superiority of our model compared to classical baselines. In the future, we plan to extend our filter to explicitly model dynamic objects in the scene.
References
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. Preprint arXiv:1406.1078, 2014.

Cordts et al. (2016)
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The Cityscapes dataset for semantic urban scene understanding.
InIEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  Dollár et al. (2009) P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 Fayyaz et al. (2016) Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Reinhard Klette, and Fay Huang. STFCN: Spatiotemporal FCN for semantic video segmentation. Preprint arXiv:1608.05971, 2016.
 Gadde et al. (2017) Raghudeep Gadde, Varun Jampani, and Peter V. Gehler. Semantic video CNNs through representation warping. In IEEE International Conference on Computer Vision (ICCV), 2017.
 Jin et al. (2016) Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. Video scene parsing with predictive feature learning. Preprint arXiv:1612.00119, 2016.
 Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? Preprint arXiv:1703.04977, 2017.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Preprint arXiv:1412.6980, 2014.
 Kiureghian and Ditlevsen (2009) Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? Does it matter? Structural Safety, 31(2):105 – 112, 2009. Risk Acceptance and Risk Communication.
 Kundu et al. (2016) Abhijit Kundu, Vibhav Vineet, and Vladlen Koltun. Feature space optimization for semantic video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3168–3175, 2016.
 Lei and Todorovic (2016) Peng Lei and Sinisa Todorovic. Recurrent temporal deep field for semantic video labeling. In European Conference on Computer Vision (ECCV), pages 302–317, 2016.
 Mahjourian et al. (2016) Reza Mahjourian, Martin Wicke, and Anelia Angelova. Geometrybased next frame prediction from monocular video. Preprint arXiv:1609.06377, 2016.
 Mahjourian et al. (2018) Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and egomotion from monocular video using 3D geometric constraints. Preprint arXiv:1802.05522, 2018.

McCormac et al. (2017)
John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison.
SceneNet RGBD: Can 5M synthetic images beat generic ImageNet pretraining on indoor segmentation.
In IEEE International Conference on Computer Vision (ICCV), 2017.  Nilsson and Sminchisescu (2016) David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. CoRR, abs/1612.08871, 2016.
 Pavel et al. (2017) Mircea Serban Pavel, Hannes Schulz, and Sven Behnke. Object class segmentation of RGBD video using recurrent convolutional neural networks. Neural Networks, Elsevier, 2017.
 Radwan et al. (2018) Noha Radwan, Abhinav Valada, and Wolfram Burgard. VLocNet++: Deep multitask learning for semantic visual localization and odometry. Preprint arXiv:1804.08366, 2018.
 Ruder (2017) Sebastian Ruder. An overview of multitask learning in deep neural networks. Preprint arXiv:1706.05098, 2017.
 Tran et al. (2016) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Deep end2end voxel2voxel prediction. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 17–24, 2016.
 Ummenhofer et al. (2017) Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for learning monocular stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 Valipour et al. (2017) Sepehr Valipour, Mennatullah Siam, Martin Jagersand, and Nilanjan Ray. Recurrent fully convolutional networks for video segmentation. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 29–36, 2017.
 Vijayanarasimhan et al. (2017) Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. SfMNet: Learning of structure and motion from video. CoRR, abs/1704.07804, 2017.
 Vu et al. (2018) TuanHung Vu, Wongun Choi, Samuel Schulter, and Manmohan Chandraker. Memory warps for learning longterm online video representations. Preprint arXiv:1803.10861, 2018.
 Wagner et al. (2018) Jörg Wagner, Volker Fischer, Michael Herman, and Sven Behnke. Hierarchical recurrent filtering for fully convolutional densenets. In European Symposium on Artificial Neural Networks (ESANN), 2018.
 Yin and Shi (2018) Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. Preprint arXiv:1803.02276, 2018.
 Zhang et al. (2014) Han Zhang, Kai Jiang, Yu Zhang, Qing Li, Changqun Xia, and Xiaowu Chen. Discriminative feature learning for video semantic segmentation. In International Conference on Virtual Reality and Visualization (ICVRV), pages 321–326, 2014.
 Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
 Zhou et al. (2017) Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and egomotion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
1 Simulation of Aleatoric Perturbations
Aleatoric failures originate from perturbations inherent in the data. To simulate such perturbations, we add noise, clutter, and changes in lighting conditions to all sequences. In the following, we give a detailed desciption of the process used to generate these perturbations. After applying the perturbations to the clean sequences generated from the SceneNet RGBD dataset, we clip pixels to the interval to get valid images. Example sequences are shown in Fig. 1.
Noise
is simulated by adding independent Gaussian noise with zero mean to each pixel. The variance of the noise is independently sampled for each sequence from the interval
.Clutter is introduced by setting subregions of each image to the pixel mean , computed on a persequence basis. The clutter is generated once persequence and applied to each frame. Thus, the resulting clutter pattern is the same in each frame, comparable to dirt on the camera lens. The perturbed images are calculated by:
(1) 
where is a persequence clutter mask, the persequence pixel mean, and the clean image. The clutter mask is generated by summing Gaussian kernels whose centers are randomly placed (uniformly sampled) within the image dimensions. Each kernel is normalized to the maximum value one. The number of kernels is uniformly sampled for each sequence from the interval
. In addition, we uniformly sample the standard deviation of each dimension independently from the interval
. The kernels are truncated at three times the standard deviation.Changes in lighting conditions are simulated by increasing or decreasing the intensity of frames. For each sequence, we uniformly sample one frame and a scaling factor from the interval . In addition, we draw a multiplier which with a probability of 0.5 is either 1 or 1. The perturbed images are calculated by:
(2)  
(3) 
2 Example Prediction of FMTNet
In Fig. 2, we show an example prediction of our FMTNet. In addition to visualizing the predicted semantic segmentation (Fig. 2c), we also show the predicted depth map (Fig. 2e) and the update gate (Fig. 2f), which are two of the human interpretable representations computed within our functionally modularized temporal filter. The model is able to predict a meaningful depth map as well as camera motion, which are required to propagate information over time. This is especially visible in the last frame of the sequence – although the last frame is missing, the model is still able to produce a meaningful semantic segmentation. In Fig. 2f, we show the gate of our update module. A white pixel corresponds to a gate value of one, which means that the model uses information provided by the current input frame. A black pixel, on the other hand, corresponds to a gate value of zero – the model relies on prior knowledge of previous frames. As expected, the gate of the first frame is fully white, since the filter has to rely on new information. In the last frame, the gate is mainly black, since no meaningful information is provided in that frame. The gate values at the right border of all frames are more white, as the model has never seen these areas before due to camera motion.
Comments
There are no comments yet.