I Introduction
Consider a typical road scene as shown in Figure 1
while driving a car. We first observe the immediate road, nearby obstacles (pedestrians, cars, cyclists, etc), followed by adjacent buildings and sky. These scene entities, or objects, can be layered in a typical road scene based on their locations. Understanding such a scene would require us to know the type of objects and spatial locations in the 3D world. Most conventional approaches look at this as two different problems: 3D reconstruction and object class segmentation. Recently, both these problems have been merged and solved as a single optimization problem. Along this avenue, several challenges exist. Prior segmentation algorithms focus on classifying each pixel individually into different semantic object classes. Such approaches are computationally expensive and may not respect the layered constraint that is preserved in most road scenes. In this paper, we jointly infer the semantic labels and depths of road scenes using the layered structure of street scenes.
In Figure 1, we use four layers to represent a street view image. The layers are ordered from the bottom of the image and they are associated with semantic classes. The first layer consists of only the ground. The second layer can have dynamic objects–vehicles, pedestrians, and cyclists. The third layer can only have buildings. The fourth layer can only contain sky. Each of these layers are supposed to model planar objects standing upright with respect to the ground at various distances from the camera. The transition between the layers happen at places where there is depth variation. In most road scenes, four layers are sufficient to model important object classes along with their layered depths. Our approach can fail in challenging scenarios with bridges or tunnels. However, these cases can be discovered by survey vehicles in an offline process. Autonomous vehicles can be alerted as we encounter these regions using GPS and we can use additional layers to correctly interpret such challenging regions.
We focus on obtaining layeraware semantic labels and depths jointly from streetview images. We use a stereo camera setup and compute the disparity cost volume for depth cues. For obtaining semantic cues, we use a deep neural network, which extracts deep features from intensity images. The depth and semantic cues are formulated in an energy function that respects the layered street scene constraint. We propose an inference algorithm based on dynamic programming to efficiently minimize this energy function. Our inference algorithm is massively parallelizable. We develop a parallel implementation and achieve a 8.8 fps processing speed on GPU. Our method outperforms the competing algorithms that do not enforce the layered constraint. Our inference algorithm is general and can work on the data from other modalities including LIDAR and Radar sensors.
Ia Related work
Our work is related to the general area of semantic segmentation and scene understanding, such as [4, 11, 18, 28, 6, 12, 29, 25]. While earlier approaches were based on handdesigned features, it has been shown recently that using deep neural networks for feature learning leads to better performance on this task [8, 13, 23, 27].
The problem of jointly solving both semantic segmentation and depth estimation from stereo camera was addressed in [20] as a unified energy minimization framework. Our work focuses on semantic labeling using ordering constraint on road scenes and using fewer classes applicable to road scenes. In [16], a typical road scene is classified into ground, vertical objects and sky to estimate the geometric layout from a single image. Objects like pedestrians and cars are segmented as vertical objects. This would be an underrepresentation for road scene understanding. [9] modeled the scene using two horizontal curves that divide the image into three regions: top, middle, and bottom.
One popular model for road scene is the stixel world that simplifies the world using a ground plane and a set of vertical sticks on the ground representing obstacles [2]. Stixels are compact and efficient representation for upright objects on the ground. The stixel representation can simply be seen as the computation of two curves. The first curve runs on the ground plane enclosing the free space that can be immediately reached without collision, and the second curve encodes the vertical objects boundary. In order to compute the stixel world, either depth map from semiglobal stereo matching algorithm (SGM) [15] or cost volume [3] can be used. As with SGM, dynamic programming (DP) enables fast implementation for the computation of the stixels. Recently, [30] demonstrated a monocular freespace estimation using appearance cues.
Stixmantics [25]
, a recently introduced model, gives more flexibility compared to stixels. Instead of having only one stixel for every column, they allow multiple segments along every column in the image and also combine nearby segments to form superpixelstyle entities with better geometric meaning. Using these stixelinspired superpixels, semantic class labeling is addressed.
We focus on obtaining layeraware semantic labels and depths jointly from streetview images. Our work is closely related to many existing algorithms in vision, and most notably with tiered scene labeling [9], joint semantic segmentation and depth estimation [20], stixels and more recently, stixmantics [25]. Our approach achieves realtime processing speed and outperforms the competing algorithms [20] in accuracy. We also achieve this performance without using explicit depth estimation and temporal constraints, which can be obtained using visual odometry. Similar to layered street view constraint, Manhattan constraints have been useful in indoor scene understanding [21, 10].
Ii Layered Street View
Our goal is to jointly estimate semantic labels and depth for each pixel in the street view image using both appearance and depth information. We adopt a layered image interpretation. An image is horizontally divided into four layers of different semantic and depth regions. The layers are ordered from the bottom to the top. In each image column, pixels belong to the same layer have the same semantic label and depth. The only exception is that depth of the pixels in the ground layer vary according to their vertical image coordinate, which is determined by the ground plane. The ground plane can either be obtained in an offline external calibration process or in an online estimation process such as using the vdisparity map [19]. We enforce a depthorder constraint, i.e. , depth of a lower layer is always smaller than depth of a higher layer in each image column.
In our four layer model, the first layer can only have ground. The second layer can have pedestrians or vehicles. The third layer can have only buildings. The fourth layer can only have sky. Note that we do not enforce that each image column has exactly four layers. A column can have any number of layers between one to four. If a layer is absent at a particular image column, the bottom of its upper layer and the top of its bottom layer are next to each other. This implies that the curves defining these layers need not be a smooth continuous one. The fourlayer model provides a flexible method for enforcing geometry and semantics to the scene. The only assumption required is the planar world assumption, which is not restrictive for many applications requiring street scene understanding. If necessary, the geometry of the layered street scene model can be further enhanced to dense depth map with additional computational cost. Similarly, the model can be improved with more layers and semantic classes.
Iia Problem Formulation
Notations: We use and to refer the sets that hold the horizontal and vertical coordinates respectively. We consider five different semantic object classes; namely, ground, vehicle, pedestrian, building, and sky. They are denoted by the symbols , , , , and respectively. The set of the semantic class labels is denoted by . We use for the set of disparity values. The words disparity and depth are used interchangeably in the paper for ease of presentation. It is understood that a onetoone conversion can be easily obtained by using the parameters in the camera calibration matrix. The cardinality of the semantic label space and disparity values are denoted by and respectively.
We formulate the layered street view problem as a constrained energy minimization problem. The constraints encode the order of the semantic object class labels and depth values in each column. It limits the solution space of the variables associated with each image column. We solve the constrained energy minimization problem efficiently using an inference algorithm based on dynamic programming.
We use the variables, , , , and , to denote the coordinates of the top pixels of the four layers in the image column . Let , , , and be the semantic object class labels for the four layers and let , , , and be the depths of the four layers in the image column . The ordering constraint and the knowledge of the ground plane allow us to fix some parameters. The actual number of unknowns is only 5 given by . Hence the label assignment for the entire image is given by . The number of possible assignments for an image column is in the order since , , and .
To rank the likelihood of the label assignment, we use evidence from image appearance features and stereo disparity matching features. We aggregate evidence from all the pixels in a column to compute the evidence. Let and be the data terms representing the semantic and depth label cost, respectively, incurred as assigning to the image column . The two terms are summed to yield the data term
(1) 
denoting the cost for assigning to the image column . Instead of working on the standard 2D Markov Random Field space where each pixel can have a depth value and a semantic label as independent variables, we reduce the problem to a constrained energy minimization problem given by
(2)  
s.t.  (3)  
(4)  
(5) 
where the constraint (3) gives the layer structure, the constraint (4) enforces the depth order, and the constraint (5) takes into account the possible semantic labels for each layer. The variable is a function of , the top pixel location of the ground layer, because we assume the dynamic object is standing upright on the ground surface at the th row of the image. The energy function in Equation (2) models the relation of pixels in the same column but not pixels in the same row. We use image patches centering around a pixel as the reception fields for the feature computation at the pixel location. Neighboring pixels have similar reception fields and thus have similar features.
The data term is the cost of assigning label to the column . It is the sum of the pixelwise data terms given by
(6) 
where the per pixel appearance data term is the cost of assigning label to the pixel and the per pixel depth data term is the cost of assigning depth to the pixel . We use a deep neural network for obtaining the per pixel appearance data term and use a standard disparity cost for obtaining the per pixel depth data term detailed in Section III. We summarize our layered street view algorithm in Figure 3.
Iii Features
We rely on two types of features: depth and appearance.
Iiia Depth
We use the smoothed absolute intensity difference for the per pixel depth data term, which is commonly used in stereo reconstruction algorithms. We first compute the pixel wise absolute intensity difference for each disparity value in , which renders a cost volume representation. A box filter is then applied to smooth the cost volume. The per pixel depth data term is given by
(7) 
where and refer to the intensity values of the left and right images, is an image patch centered at , and denotes the cardinality of the patch, serving as a normalization constant. The patch size is fixed to 11by11 in our experiments.
IiiB Appearance
We compute per pixel appearance data term using a deep neural netwrok. Our network, shown in Figure 4, consists of two parts: a multiscale convolutional neural network (MSCNN) and a recursive context propagation network (RCPN). The MSCNN allows us to extract multiscale appearance cues and RCPN allows us to extract rich contextual information.
MSCNN: Our MSCNN is a close variant to the neural network proposed in [8]. It has 3 convolutional layers. The first convolutional layer has 16 filters of size
followed by rectified linear unit (ReLU) and
maxpooling processing. The second layer has 64 filters of size followed by ReLU and maxpooling processing. The third layer has 256 filters of size followed by ReLU. The stacking of the convolutional layers yields a reception field of 47by47 pixels. Due to maxpooling, the resolution of the output feature map is smaller than that of the input image. We upsample the feature map to have the same resolution.The 3layer convolutional neural network (CNN) is applied separately to three scales of the Gaussian pyramid of the input image. Specifically, we downsample the input image using three different scales (, , and ) and use the 3layer CNN to extract features at each scale. We use upsampling to bring all the feature maps to the same resolution. The feature maps from the three scales are concatenated to obtain the MSCNN feature map. Note that the filter weights are constrained to be the same for all the scales. The MSCNN extracts multiscale appearance information for each pixel, which is then passed to the RCPN. For further details about MSCNN, please refer to [8].
RCPN: Our RCPN is a variant of the network proposed in [27]. It consists of three subnetworks: the semantic mapper network, the semantic combiner network, and the semantic decombiner network. We use the RCPN to embed rich context information to the output appearance features. The semantic mapper network is a 1layer CNN with 128 filters of size followed by ReLU. It maps each 768dimensional feature of the MSCNN feature map to a 128dimensional semantic feature, which is then fed into the semantic combiner network.
The semantic combiner network has three recursive layers, and each contains 128 filter of size followed by ReLU. The semantic combiner fuses input features in a 4by4 region of the input semantic feature map to an output semantic feature. This process is nonoverlapping; hence, each semantic combiner layer generates an output feature map that is 16 times smaller than the input one. Applying the 3layer recursive combiner network renders an output feature map that is 4096 times smaller than the input feature map of the semantic mapper. The semantic combiner network recursively embeds context information from image regions with larger and larger spatial support. The output feature maps from the three layers form a context feature pyramid, which is fed into the semantic decombiner network.
Similar to the semantic combiner network, the semantic decombiner network has three recursive layers and each contains 128 filter of size followed by ReLU. It is used to recursively distribute context information residing in the context pyramid back to the individual pixels, from higher to lower levels of the context pyramid. Each decombiner layer fuses the feature map from the previous decombiner layer with that from the corresponding level of the context pyramid. Note that our RCPN implementation differs from [27] in two major places: 1) we use square patches for context propagation while [27] uses superpixels[22], and 2) we use pyramids to represent hierarchy of context information while [27] uses superpixel trees. Our design choices allow a more efficient implementation because we do not use superpixel segmentation.
Training:
We use grayscale images. The pixel intensity values are scaled between 0 to 1 and centered by subtracting 0.5 before being fed into the deep neural network. To train the network, we connect the output layer to a fully connected layer having 5 neurons, corresponding to ground, pedestrian, vehicle, building, and sky classes. The fully connected layer is followed by the softmax layer. The network is trained by minimizing the crossentropy error via stochastic gradient descent with momentum. We use the Caffe library
[17] for training. The number of pixels in the semantic classes can be quite different. To avoid the bias from dominant classes (ground and building), we weight the crossentropy loss based on the semantic class distribution, which yields better performance in practice.We use the negative logarithm of the softmax scores of the semantic classes as the perpixel appearance data terms. Let be the softmax score of the deep neural network at pixel location
, which represents the probability of assigning the label
to the pixel. The per pixel appearance data term is given by(8) 
where is a parameter controlling the relative weight of the appearance and depth data terms.
Iv Efficient Inference Algorithm
We decompose the energy minimization problem in Equation (2) into subproblems where the th subproblem is given by
(9)  
s.t.  (10)  
(11)  
(12) 
We solve each of the subproblems optimally and combine their solutions to construct the semantic labeling and depth map of the image. For simplicity, we will drop the subscript in the discussion below.
Each of the subproblems can be mapped to a 1D chain labeling problem. The chain has nodes where the first node contains the variables , the second node contains the variables , the third node contains the variables , and the fourth node contain the variables . Utilizing the recursion in the label cost evaluation, a standard dynamic programming algorithm can solve the inference on the 1D chain with a complexity of where the product represents the size of label space at each node and the second comes from the label cost evaluation at each node. Unfortunately, the complexity is too high for realtime applications.
We propose a variant of the dynamic programming algorithm to reduce the complexity of solving the subproblem in (9) to and achieve realtime performance. We first note that some of the variables are known from our street view setup as discussed in the problem formulation section. We only need to search the values for . For any combination of and , we need to find the best combination of and . In the following, we show that precomputing the best combination of and for any and can be achieved in time using recursion.
We first observe that the problem in (9) can then be written as
(13) 
where is an intermediate cost table given by
(14) 
Note that depth of the second layer object is a function of because can be uniquely determined from and the ground plane equation. As a result, depends on both and .
By integrating and along the direction, the sum given by
(15) 
for all combination of and can be computed in time for each . We further note that can be computed via a recursive update rule given by
(16) 
where is an integer satisfying and is used to ensure that the depth ordering constraint between the second and third layers is met. Intuitively, we are computing a running min structure along the decreasing depth of the building layer. The recursive update rule allows us to compute for any and in time. As a result, the complexity of finding the best configuration for a partition can be reduced to where is the time required for searching combinations of and . We perform the 1D labeling algorithm to each image column. The overall complexity of the labeling algorithm is .
V Implementation
Our algorithm is massively parrallelizable and can be implemented using CUDA, a general purpose parallel computing language for NVIDIA GPUs [5]. A GPU comprises a large number of SingleInstructionMultipleData (SIMD) processor cores to allow many threads to execute common operations concurrently on large data arrays. In our implementation of the labeling algorithm, we exploit data level parallelism in all stages of computation.

Depth data term: We use threads to compute the disparity values for each pixel at . We implemented the box filter using the sliding window approach. First, we use threads to perform 1D sliding window on each row. As the window moves from left to right in the horizontal direction, the new pixel value on the right is added and the existing one on the left is subtracted. We then use threads to carry out the same computation for each column.

Appearance data term: We use the Caffe library [17] to compute the softmax scores output of the deep neural network. We use NVIDIA’s cuDNN library to achieve additional speed up.

Intermediate cost table: We first compute the integral of and over such that the two sum terms can be retrieved in constant time for any range. This can be done in parallel for each column . We observe that for a fixed , computing over each and can be jointly parallelized. Therefore we use threads to compute an intermediate table for and and use threads to find the combinations that yield the minimum cost for each and .

Energy minimization and labeling: Using previously computed , we use threads to search the with the minimum cost in each column in parallel.
Memory layout is an important factor for processing speed in GPU. By default, our image data is stored in rowmajor form. The GPU implementation naturally takes advantage of this memory layout by assigning threads to work on pixels on the same row, which resides in memory as a continuous array. This allows GPU to coalesce the memory accesses of the threads such that the GPU memory bandwidth is efficiently utilized. In addition, our algorithm avoids reshaping or transposing the data in the memory, which would take extra memory and time.
We execute our algorithm on a Windows 7 desktop computer equipped with NVIDIA Tesla K40 GPU along with Intel i7 processor. We set the size of onedimension thread blocks to be 64 and the size of twodimension thread blocks to be to facilitate efficient scheduling of the threads on target GPU. To avoid register spills to local memory, we minimize local variable declarations. In our algorithm, no data needs to be shared among threads within a block, therefore shared memory is not used in the implementation.
ClassMethod 





Proposed  
Ground  94.9  93.8  95.7  96.7  96.7  96.4  
Vehicle  76.0  78.8  68.7  79.4  80.7  83.3  
Pedestrian  73.1  66.0  21.2  68.4  61.3  71.1  
Sky  95.5  75.4  94.2  91.4  87.6  89.5  
Building  90.6  89.2  87.6  86.3  87.4  91.2  
Avg (all)  86.0  80.6  73.5  84.5  82.8  86.3  
Avg (dynamic)  74.5  72.4  44.9  73.8  71.0  77.2  
Runtime per frame in second  111  0.05  N/A  2.8  0.07  0.11 
Vi Experiments
Benchmark: We evaluated our approach using the public Daimler Urban Segmentation dataset [24]. The dataset contains 500 stereo grayscale image pairs with pixelwise semantic class annotations for the left images. While the image size in the dataset is 1024x440, only the middle region from to is fully labeled. Hence, the effective image size is 976x360. The dataset is composed for evaluating only the semantic labeling using stereo image pairs. There is no ground truth for depth.
The semantic labels in the annotations include ground, sky, building, pedestrian, vehicle, curbs, bicyclist, motorcyclist, and background clutters. However, only the ground, sky, building, pedestrian, and vehicle are considered in the evaluation protocol. The performance metric for the semantic labeling is based on the PASCAL VOC intersection over union (IoU) measure [7], which is the ratio of cardinality of the intersection of the ground truth and estimated semantic segments over that of their union. Let and be the set of pixels labeled as sky in the computed semantic class label map and the ground truth label map, respectively. The IoU measure of the sky class is given by
(17) 
The larger the IoU measure, the better the matching between the ground truth and estimated segments; and, hence, the better the semantic labeling accuracy.
Evaluation: We followed the evaluation protocol described in [25], which used the first 300 stereo image pairs in the dataset for training and the remaining 200 stereo image pairs for testing. During testing, we downsampled the input images by half in each dimension. This was necessary for our GPU implementation. During evaluation, we upsampled the image to the original size.
We use left images of the stereo pairs for training our deep neural network. During testing, the network outputs persemanticclass softmax scores for each pixel. We compared our approach with several approaches. The competing approaches include the jointoptimal ALE (ALE) algorithm [20], the stixmantics [25], the Darwin pairwise [14], and the PNRCPN [26]. The algorithms in [20, 26] utilizes superpixel segmentations as input, which demands additional computation resources. The stixmantics algorithm [25] uses depth, obtained using an FPGA chip, and temporal constraints from adjacent stereo images. We achieve better accuracy and comparable computational performance without using an FPGA chip and temporal constraints.
In Table I, we compare the semantic labeling accuracy of the competing algorithms. The results of the competing algorithms are duplicated from the Daimler dataset website [1]. Note that the results in the website are different from those reported in the original paper [25] because the unlabeled pixels were initially not excluded from the IoU computation in the original paper [25].
Performance: In the table, we show that our method achieves the stateoftheart performance of 86.3%, while ALE achieves an accuracy of 86.0%. The stixmantics algorithm achieves an accuracy of 80.6%, which is significantly lower than our method and ALE. In terms of the performance for the dynamic objects (vehicles and pedestrians), the proposed algorithm achieves an accuracy of 77.2%, outperforming ALE, which gets 74.5%. In terms of speed, we are several magnitude faster than ALE. We generate both depth and semantic labels in 114.1 ms, while ALE takes 111,000 ms. The stixmantics algorithm is slightly faster than our method requiring only 50 ms. However, they use an FPGA chip to precompute the depth before estimating the semantic labels, whereas we jointly compute both semantic labels and depth.
In Figure 5, we visualize the semantic labels and depth from the proposed algorithm. Qualitatively, we find that we obtain visually accurate semantic labels and our depth map resembles a piecewise planar approximation of the 3D scene.
We observe that our appearance features alone performs quite well. It achieves an average accuracy of 82.8%. The layered constraint allows us to avoid impossible labelings such as having ground regions in the middle of sky, or having vehicles in the middle of a building, etc. By incorporating this constraint, our performance improves to 86.3%, which corresponds to a 20.3% error reduction. This improvement can be qualitatively seen in a few examples in Figure 6.
Component name  Execution time in ms 

Depth data term (cost volume)  5.2 
Appearance data term (DNN)  70.0 
Intermediate table  25.6 
Inference  13.3 
Overall  114.1 
We report the execution time required by the individual steps in our algorithm in Table II. All the computation is performed in the GPU. Overall, our algorithm takes 114.1 ms to infer the semantic labels and depth. The run time can be further reduced by half, by utilizing a second GPU card for processing the neural network computation.
Vii Conclusion
We propose a novel layered street view model and develop an efficient algorithm to jointly estimate semantic labels and depth for street view images. We obtain this result using appearance features, which can be computed from a deep neural network, and depth features, which can be derived from stereo disparity costs. Our algorithm outperforms the competing methods on the Daimler Urban Segmentation data set and runs at 8.8 framepersecond using a GPU implementation.
Acknowledgement
We thank Jay Thornton, Shumpei Kameyama, reviewers, and area chairs for their feedback and support of this work.
References
 [1] Daimler urban segmentation dataset, 2015.
 [2] H. Badino, U. Franke, and D. Pfeiffer. The stixel world  a compact medium level representation of the 3dworld. In DAGM, 2009.

[3]
R. Benenson, R. Timofte, and L. Gool.
Stixels estimation without depth map computation.
In
IEEE International Conference on Computer Vision
, 2011.  [4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with secondorder pooling. In European Conference on Computer Vision, 2012.
 [5] N. Corporation. NVIDIA CUDA C programming guide. 2014.
 [6] A. Ess, T. Mueller, H. Grabner, and L. Gool. Segmentationbased urban traffic scene understanding. In British Machine Vision Conference, 2009.
 [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
 [8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915–1929, 2013.

[9]
P. F. Felzenszwalb and O. Veksler.
Tiered scene labeling with dynamic programming.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 3097–3104, 2010.  [10] A. Flint, D. Murray, and I. Reid. Manhattan scene understanding using monocular, stereo, and 3d features. In IEEE International Conference on Computer Vision.
 [11] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In IEEE International Conference on Computer Vision, 2009.
 [12] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3d traffic scene understanding from movable platforms. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
 [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.

[14]
S. Gould.
Darwin: A framework for machine learning and computer vision research and development.
The Journal of Machine Learning Research, 13(1):3533–3537, 2012.  [15] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
 [16] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo popup. ACM Transactions on Graphics, 2005.
 [17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [18] P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Neural Information Processing Systems, 2011.
 [19] R. Labayrade, D. Aubert, and J.P. Tarel. Real time obstacle detection in stereovision on non flat road geometry through” vdisparity” representation. In Intelligent Vehicle Symposium, volume 2, pages 646–651. IEEE, 2002.
 [20] L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. Clocksin, P. H. Torr, et al. Joint optimisation for object class segmentation and dense stereo reconstruction. In British Machine Vision Conference, 2010.
 [21] D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoning for single image structure recovery. In IEEE Conference on Computer Vision and Pattern Recognition.
 [22] M.Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa. Entropy rate superpixel segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
 [23] P. Pinheiro and R. Collobert. Rich feature hierarchies for accurate object detection and semantic segmentation. International Conference on Machine Learning, 2014.
 [24] T. Scharwächter, M. Enzweiler, U. Franke, and S. Roth. Efficient multicue scene segmentation. In Pattern Recognition, pages 435–445. Springer, 2013.
 [25] T. Scharwächter, M. Enzweiler, U. Franke, and S. Roth. Stixmantics: A mediumlevel model for realtime semantic scene understanding. In European Conference on Computer Vision, pages 533–548. Springer, 2014.
 [26] A. Sharma, O. Tuzel, and D. W. Jacobs. Deep hierarchical parsing for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 2015.
 [27] A. Sharma, O. Tuzel, and M.Y. Liu. Recursive context propagation network for semantic scene labeling. In Neural Information Processing Systems, pages 2447–2455, 2014.
 [28] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding:multiclass object recognition and segmentation by jointly modeling texture, layout, and context. In International Journal of Computer Vision, 2009.
 [29] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele. Monocular visual scene understanding: Understanding multiobject traffic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
 [30] J. Yao, S. Ramalingam, Y. Taguchi, Y. Miki, and R. Urtasun. Estimating drivable collisionfree space from monocular video. In Winter Conference on Applications of Computer Vision, 2015.