1 Introduction
Understanding complex layouts of the 3D world is a crucial ability for applications like robot navigation, driver assistance systems or autonomous driving. Recent success in deep learningbased perception systems enables pixelaccurate semantic segmentation
[3, 4, 33]and (monocular) depth estimation
[9, 15, 32] in the perspective view of the scene. Other works like [10, 23, 25]go further and reason about occlusions and build better representations for 3D scene understanding. The representation in these works, however, is typically nonparametric,
i.e., it provides a semantic label for a 2D/3D point of the scene, which makes higherlevel reasoning hard for downstream applications.In this work, we focus on understanding driving scenarios and propose a rich parameterized model describing complex road layouts in a topview representation (Fig. 1 and Sec. 3.1). The parameters of our model describe important scene attributes like the number and width of lanes, and the existence and distance to various types of intersections, crosswalks and sidewalks. Explicit model of such parameters is beneficial for higherlevel modeling and decision making as it provides a tangible interface to the real world. In contrast to prior art [7, 14, 17, 23, 24, 25], our proposed scene model is richer, fully parameterized and can be inferred from a single camera input with a combination of deep neural networks and a graphical model.
However, training deep neural networks requires large amounts of training data. Although annotating the scene attributes of our model for real RGB images is possible, it is also costly to do at a largescale and, more importantly, extremely difficult for certain scene attributes. While the existence of a crosswalk is a binary attribute and is easy to annotate, annotating the exact width of a side road requires the knowledge of scene geometry, which is hard when only given a perspective RGB image. We thus propose to leverage simulated data. However, in contrast to rendering photorealistic RGB images, which is a difficult and timeconsuming task [20, 21], we propose a scene model that allows for efficient sampling and render semantic topview representations that obviate expensive illumination modeling or occlusion reasoning.
Given simulated data with accurate and complete annotations, as well as real images with potentially noisy and incomplete annotations, we propose a hybrid training procedure leveraging both sources of information. Specifically, our neural network design involves a domainspecific feature extractor that tries to bridge the gap between simulated topviews and real semantic ones from [23] (see Fig. 3
) with adversarial training and a domainagnostic classifier of scene parameters. At test time, we convert a perspective RGB image into a semantic topview representation using
[23] and predict our scene model’s parameters. Given the individual scene parameter predictions, we further design a graphical model (Sec. 3.4) that captures dependencies among scene attributes in single images and enforces temporal consistency across a sequence of frames. We validate our idea on two public driving data sets, KITTI [8] and NuScenes [18] (Sec. 4). The results demonstrate the effectiveness of the topview representation, the hybrid training procedure with real and simulated data, and the importance of the graphical model for coherent and consistent outputs. To summarize, our key contributions are:
A novel parametric and interpretable model of complex driving scenes in a topview representation.

A neural network that (i) predicts the parameters from a single camera and (ii) is designed to enable a hybrid training approach from both real and synthetic data.

A graphical model that ensures coherent and temporally consistent scene description outputs.
2 Related Work
3D scene understanding is an important task in computer vision with many applications for robot navigation
[11], selfdriving [7, 14], augmented reality [1] or real estate [16, 27].Outdoor scene understanding
Explicit modeling of the scene is frequently done for indoor applications where strong priors about the layout of rooms can be leveraged [1, 16, 26]. Nonparametric approaches are more common for outdoor scenarios because the layout is typically more complex and harder to capture in a coherent model, with occlusion reasoning often a primary focus. Due to the natural ability to reflect orders, layered representations [29, 10, 31] have been utilized in scene understanding to reason geometry and semantics in occluded areas. However, such intermediate representation is not desired for applications where distance information is required. A topview representation [25, 23]
, in contrast, is a more detailed representation for 3D scene understanding. Our work follows the topview representation and aims to infer a parametric model of complex outdoor driving scenes from single image input.
A few parametric models have been proposed for outdoor environments too. Seff and Xiao [24] present a neural network that directly predicts scene attributes from a single RGB image. Although those attributes are automatically acquired from OpenStreetMaps [19], they are not rich enough to fully describe complex road scenes, e.g. curved road with sideroads. A richer model that is capable of handling complex intersections with traffic participants is proposed by Geiger et al. [7]. To this end, they propose to utilize multiple modalities such as vehicle tracklets, vanishing points and scene flow. Different from their work, we focus more on scene layouts and propose in Sec. 3.1 a richer model in that aspect, including multiple lanes, crosswalks and sidewalks. Moreover, our base framework is able to infer model parameters with a single perspective image as input. More recent work [14] proposes to infer a graph representation of the road, including lanes and lane markings, from partial segmentations of an image. Unlike our method that aims to handle complex road scenarios, it focuses only on straight roads. Máttyus et al. propose an interesting parametric model of roads with the goal of augmenting existing map data with richer semantics. Again, this model only handles straight roads and requires input from both perspective and aerial images. Perhaps [23] is the closest work to ours. In contrast to it, we propose a fullyparametric model that is capable of reconstructing complex road layouts.
Learning from simulated data
Besides the scene model itself, one key contribution of our work is the training procedure that leverages simulated data, where we also utilize tools from domain adaptation [6, 30]. While most recent advances in this area focus on bridging domain gaps between synthetic and real RGB images [21, 20], we benefit from the semantic topview representation within which our model is defined. This representation allows efficient modeling and sampling of a variety of road layouts, while avoiding the difficulty of photorealistic renderings, to significantly reduce the domain gap between simulated and real data.
3 Our Framework
The goal of this work is to extract interpretable attributes of the layout of complex road scenes from a single camera. Sec. 3.1 presents our first contribution, a parameterized and rich model of road scenes describing attributes like the topology of the road, the number of lanes or distances to scene elements. The design of our scene model allows efficient sampling and, consequently, enables the generation of largescale simulated data with accurate and complete annotations. At the same time, manual annotation of such scene attributes for real images is costly and, more importantly, even infeasible for some attributes, see Sec. 3.2. The second contribution of our work, described in Sec. 3.3, is a deep learning framework that leverages training data from both domains, real and simulation, to infer the parameters of our proposed scene model. Finally, our third contribution is a conditional random field (CRF) that enforces coherence between related parameters of our scene model and encourages temporal smoothness for video inputs, see Sec. 3.4.
3.1 Scene Model
Our model describes road scenes in a semantic topview representation and we assume the camera to be at the bottom center in every frame. This allows us to position all elements relative to the camera. On a higher level, we differentiate between the ‘‘main road’’, which is where the camera is, and eventual ‘‘side roads’’. All roads consist of at least one lane and intersections are a composition of multiple roads. Fig. 2 gives an overview of our proposed model.
Defining two side roads (one on the left and one on the right of the main road) along with distances to each one of them gives us the flexibility to model both 3way and 4way intersections. An additional attribute determines if the main road ends after the intersection, which yields Tintersections.
Each road (main or side) is defined by a set of lanes, one or twoway traffic, delimiters and sidewalks. For the main road, we define up to six lanes on the left and right side of the camera, which occupies the egolane. We allow different lane widths to model special lanes like turn or bikelanes. Next to the outer most lanes, optional delimiters of a certain width separate the road from the optional sidewalk. At intersections, we also model the existence of crosswalks at all four potential sides. Our final set of parameters is grouped into different types and we count binary variables , multiclass variables and continuous variables . The supplemental material contains a complete list of our model parameters. Note that the ability to work with a simple simulator means we can easily extend our scene model with further parameters and relationships.
3.2 Supervision from Real and Simulated Data
Inferring our model’s parameters from an RGB image requires abundant training data. Seff and Xiao [24] leverage OpenStreetMaps [19] to gather ground truth for an RGB image. While this can be done automatically given the GPS coordinates, the set of attributes retrievable is limited and can be noisy. Instead, we leverage a combination of manual annotation and simulation for training.
Real data:
Annotating real images with attributes corresponding to our defined parameters can be done efficiently only when suitable tools are used. This is particularly true for sequential data because many attributes stay constant over a long period of time. The supplemental material contains details on our annotation tool and process. We have collected a data set of samples of semantic topviews and corresponding scene attributes . The semantic topviews , with spatial dimensions , contain semantic categories ("road", "sidewalk", "lane boundaries" and "crosswalks") and are computed by applying the framework of [23]. However, several problems arise with real data. First, ground truth depth is required at a reasonable density for each RGB image to ask humans to reliably estimate distances to scene elements like intersections or crosswalks. Second, there is always a limit on how much diverse data can be annotated costefficiently. Third, and most importantly, not all desired scene attributes are easy or even possible to annotate at a largescale, even if depth information is available. For these reasons, we explore simulation as another source of supervision.
Simulated data:
Our proposed scene model defined in Sec. 3.1
can act as a simulator to generate training data with complete and accurate annotation. First, by treating each attribute as a random variable with a certain handdefined (conditional) probability distribution and relating them in a direct acyclic graph, we can use ancestral sampling
[2] to efficiently sample a diverse set of scene parameters . Second, we render the scene defined by the parameters into a semantic topview with the same dimensions as . It is important to highlight that rendering is easy, compared to photorealistic rendering of perspective RGB images [20, 21], because our model (i) works in the topview where occlusion reasoning is not required and (ii) is defined in semantic space making illumination or photorealism obsolete. We generate a data set of simulated semantic topviews and corresponding . Fig. 3 illustrates the difference between real and simulated topviews with a few examples.3.3 Training and Inferring the Scene Model
We propose a deep learning framework that maps a semantic topview into the scene model parameters . Figure 4 provides a conceptual illustration. To leverage both sources of supervision (real and simulated data) during training, we define this mapping as
(1) 
where defines a function composition and and are neural networks, with weights and respectively, that we want to train. The architecture of
is a 6layer convolutional neural network (CNN) that converts a semantic topview
into a 1dimensional feature vector
. Then, the functionis defined as a multilayer perceptron (MLP) predicting the scene attributes
given . Specifically, is implemented as a multitask network with three separate predictions , and for each of the parameter groups , and .Our objective is that works well on real data, while we want to leverage the rich and large set of annotations from simulated data during training. The intuition behind our design is to have a domainspecific encoding that maps semantic topviews of different domains into a common feature representation, usable by a domainagnostic classifier
. To realize this intuition, we define supervised loss functions on both real and simulated data and leverage domain adaptation techniques to minimize the domain gap between the output of
given topviews from different domains.Loss functions on scene attribute annotation:
Given data sets and of real and simulated data, we define a supervised loss as
(2) 
The scalars and weigh the importance between real and simulated data and
(3) 
where (B)CE is the (binary) crossentropy loss and denotes the th sample in the data set. For regression, we discretize continuous variables into bins by convolving a dirac delta function centered at
with a Gaussian of fixed variance, which enables easier multimodal predictions and is useful for the graphical model defined in Sec.
3.4. We ignore scene attributes without manual annotation for .Bridging the domain gap:
Since our goal is to leverage simulated data during the training process, our network design needs to account for the inherent domain gap. We thus define separate feature extraction networks
and with shared weights that take as input semantic topviews from either domain, i.e., or , and compute respective features and . We then explicitly encourage a domainagnostic feature representation by employing an adversarial loss function [6]. We use an MLP with parameters as discriminator, that takes the feature representations from either domain, i.e., or , as input and makes a binary prediction into "real" or "fake". As in standard generative adversarial networks, has the goal to discriminate between the two domains, while the rest of the model aims to confuse the discriminator by providing inputs indistinguishable in the underlying distribution, i.e., a domainagnostic representation of the semantic topview maps .Optimization:
3.4 CRF for Coherent Scene Understanding
We now introduce our graphical model for predicting consistent layouts of road scenes. We first present our CRF for single frames and then extend it to the temporal domain.
Single image CRF:
Let us first denote the elements of scene attributes and corresponding predictions as and , where we use indices , and for binary, multiclass and continuous variables, respectively. We then formulate scene understanding as the energy minimization problem
(5) 
where denotes energy potentials for the associated scene attribute variables (, and ). We will describe the details for each of those potentials in the following.
For binary variables , our potential function consists of two terms,
(6) 
where are the unary and pairwise terms. The unary term specifies the cost of assigning a label to and is defined as , where is the probabilistic output of our neural network . The pairwise term defines the cost of assigning and to th and th variable as , where is the cooccurrence matrix and is the corresponding probability. For multiclass variables, our potential is defined as , where and . Similarly, we define with being the negative loglikelihood of .
For a coherent prediction, we further introduce the potentials , and to model correlations among scene attributes. The potentials and enforce hard constraints between certain binary variables and multiclass or continuous variables to convey the idea that, for instance, the number of lanes of a sideroad is consistent with the actual existence of that sideroad. We denote the set of predefined pairs between and as and between and as . Potential is then defined as
(7) 
where is the indicator function. Potential is defined likewise but using the set and variables . In both cases, we give a high penalty to scenarios where two types of predictions are inconsistent.
Finally, the potential of our energy defined in Eq. (5) models higherorder relations between , and . The potential takes the form
(8) 
where and is a table where conflicting predictions are set to 1. The supplementary material contains a complete definition of the relations between scene attributes and the constraints we enforce on them.
Temporal CRF:
Given videos as input, we propose to extend our CRF to encourage temporally consistent and meaningful outputs. We extend the energy function from Eq. (5) by two terms that enforce temporal consistency of binary and multiclass variables and smoothness for continuous variables. Due to space limitations, we refer to the supplementary for details of our formulation.
Learning and inference on CRF:
Since ground truth is not available for all frames, we do not introduce perpotential weights except for , which we set to . However, once weights are introduced for each potential, our graphical model is amenable to piecewise learning [28] or joint learning [5, 34] if groundtruth is provided. QPBO [22] is used for inference in both single image and videobased CRFs.
4 Experiments
To evaluate the quality of our scene understanding approach we conduct several experiments and analyze the importance of different aspects of our model. Since we do have manuallyannotated ground truth, we can quantify our results and compare with several baselines that demonstrate the impact of two key contributions: the use of topview maps and simulated data for training. We also put a significant emphasis on qualitative results in this work for two reasons: First, not all attributes of our model are actually contained in the manuallyannotated ground truth and can thus not be quantified but only qualitatively verified. Second, there is obviously no prior art showing results on this novel set of ground truth data, which makes the analysis of qualitative results even more important.
Datasets:
Since our focus is on driving scenes and our approach requires semantic segmentation and depth annotation, we choose to work with the KITTI [8] and the newly released NuScenes [18]^{1}^{1}1At the time of conducting experiments, we only had access to the prerelease of the data set. data sets. Although both data sets provide laserscanned data for depth ground truth, note that depth supervision can also come from stereo images [9]. Also, since NuScenes [18] does not provide semantic segmentation, we reuse the segmentation model from KITTI. For both data sets, we manually annotate a subset of the images with our scene attributes. Annotators see the RGB image as well as the depth ground truth and provide labels for 22 attributes of our model. We refer to the supplementary for details on the annotation process. In total, we acquired around 17000 annotations for KITTI [8] and 3000 annotations for NuScenes [18], which we split into training and testing according to the splits of the perception framework.
Evaluation metrics:
Since the output space of our prediction is complex and consists of a mixture of discrete and continuous variables, which require different handling, we use multiple different metrics for evaluation.
For binary variables (like the existence of side roads) and for multiclass variables (like the number of lanes), we measure accuracy as and . For regression variables we use the mean squard error (MSE).
Besides these standard metrics, we also propose another metric that combines all predicted variables and outputs into a single number. We take the predicted parameters and render the scene accordingly. For the corresponding image, we take the ground truth parameters (augmented with predicted values for variables without ground truth annotation) and render the scene, which assigns each pixel a semantic category. For evaluation, we can now use intersectionoverunion (IoU), a standard measure in semantic segmentation. While being a very challenging metric in this setup, it implicitly weighs the attributes by their impact on the area of the topview. For instance, predicting the number of lanes incorrectly by one has a bigger impact than getting the distance to a sideroad wrong by one meter.
KITTI [8]  NuScenes [18]  

Method  Accu.Bi.  Accu.Mul.  MSE  IOU  Accu.Bi.  Accu.Mul.  MSE  IOU 
MRGB [24]  .811  .778  .230  .317  .846  .604  .080  .316 
MRGB [24]+D  .799  .798  .146  .342  .899  .634  .021  .335 
MBEV [23]  .820  .777  .141  .345  .852  .601  .022  .269 
MBEV [23] +GM  .831  .792  .136  .350  .852  .601  .036  .338 
SBEV  .694  .371  .249  .239  .790  .366  .162  .155 
SBEV+DA  .818  .677  .222  .314  .753  .568  .103  .171 
SBEV+DA+GM  .847  .683  .230  .320  .723  .568  .081  .160 
HBEV  .816  .756  .152  .342  .783  .569  .039  .345 
HBEV+DE  .830  .776  .158  .381  .854  .626  .042  .423 
HBEV+DA  .845  .792  .108  .398  .856  .545  .028  .346 
HBEV+DA+GM  .849  .805  .098  .371  .855  .626  .033  .450 
4.1 Single Image Evaluation
Our main experiments are conducted with a single image as input. In the next section, we separately evaluate the impact of temporal modeling as described in Sec. 3.4.
Baselines:
Since we propose a scene model of roads with new attributes and corresponding ground truth annotation, there exist no previously reported numbers. We thus choose appropriate baselines that are either variations of our model or relevant prior works extended to our scene model:

ManualGTRGB+Depth (MRGB+D): Same as MRGB but with the additional task of monocular depth prediction (as in our perception model). The intuition is that this additional supervision aids predicting certain scene attributes, e.g., distances to side roads, and renders a more fair comparison point to our model.

SimulationBEV (SBEV): This baseline uses the same architecture as MBEV but is trained only in simulation.

SimulationBEV+DomainAdapt (SBEV+DA): Same as SBEV, but with additional domain adaptation loss as proposed in our model.
We denote our approach proposed in Sec. 3, according to the nomenclature above, as HybridBEV+DomainAdapt (HBEV+DA) and further explore two variants of it. First, HBEV does not employ the discriminator but still trains from both domains. Second, HBEV+DE also avoids the discriminator but uses a separate set of weights and for the feature extraction network . The intuition is that the supervised losses from both domains and the separate domainspecific encoding (thus, "+DE") already provide enough capacity and information to the model to find a domainagnostic representation of the data. Please refer to Fig. 5 for an overview of the different models we compare. For the best models among each group (M, S and H), we report numbers with the graphical model (+GM).
Quantitative results:
Tab. 1 summarizes our main results for both data sets and we can draw several conclusions. First, when comparing the groups of methods by supervision type, i.e., manual (M), simulation (S) and hybrid (H), we can clearly observe the benefit of hybrid methods leveraging both domains. Second, within the group of manual annotation, we can see that adding depth supervision to the approach of [24] significantly improves results, particularly for continuous variables. Predicting scene attributes directly from the the topview representation of [23] is slightly better than MRGB+D on KITTI and worse on NuScenes, but has the crucial advantage that augmentation with simulated data in the topview becomes possible, as illustrated with all hybrid variants. Third, within the group of simulated data, using domain adaptation techniques (SBEV+DA) has a significant benefit. We want to highlight the competitive overall results of SBEV+DA, which is an unsupervised domain adaptation approach requiring no manual annotation. Forth, also for hybrid methods, explicitly addressing the domain gap (HBEV+DE and HBEV+DA) enables higher accuracy. Finally, all models improve with our graphical model put on top.
Qualitative results:
We show several qualitative results in Fig. 6 and Fig. 7 and again highlight their importance to demonstrate the practicality of our approach qualitatively. We can see from the examples that our model successfully describes a diverse set of road scenes.
KITTI [8]  NuScenes [18]  

Method  seman.  temp.  seman.  temp. 
SBEV+DA  2.82  5.32  1.08  2.09 
MBEV [23]  2.65  3.99  1.09  1.27 
HBEV+DA  5.59  6.01  1.08  1.05 
+GM  1.77  1.93  0.11  0.42 
4.2 Evaluating consistency of our model
We now analyze the impact of the graphical model on the consistency of our predictions, for which we define the following metrics:

Semantic consistency: we measure the conflicts in attribute predictions w.r.t. their semantic meanings. Specifically, we count a conflict if predicted attributes are not feasible in our scene model. The average number of conflicts is reported as our semantic consistency measurement.

Temporal consistency: for each attribute prediction among a video sequence, we measure the number of changes in the prediction. We report the average number of prediction changes as the temporal consistency. The lower the number is, the more stable prediction we would obtain. Note that consistency itself cannot replace the accuracy since a prediction can also be consistently wrong.
As for the temporal consistency, we visualize qualitative results of consecutive frames in two validation sequences from KITTI in Fig. 7. The graphical model successfully enforces temporal smoothness, especially for number of lanes, delimiter width and the width of sideroads.
Finally, we show in Tab. 2 quantitative results for the temporal consistency metrics defined above on both KITTI and NuScenes data sets. We compare representative models from each group of different forms of supervision (M, S and H) with the output of the graphical model applied on HBEV+DA. We can clearly observe a significant improvement in consistency for both data sets. Together with the superior results in Tab. 1, this clearly demonstrates the benefits of the proposed graphical model for our application.
5 Conclusion
In this work, we present a scene understanding framework for complex road scenarios. Our key contributions are: (1) A parameterized and interpretable model of the scene that is defined in the topview and enables efficient sampling of diverse scenes. The semantic topview representation makes rendering easy (compared to photorealistic RGB images in perspective view), which enables the generation of largescale simulated data. (2) A neural network design and corresponding training scheme to leverage both simulated as well as manuallyannotated real data. (3) A graphical model that ensures coherent predictions for a single frame input and temporally smooth outputs for a video input. Our proposed hybrid model (using both sources of data) outperforms its counterparts that use only one source of supervision in an empirical evaluation. This confirms the benefits of the topview representation, enabling simple generation of largescale simulated data and consequently our hybrid training.
References
 [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3D Semantic Parsing of LargeScale Indoor Spaces. In CVPR, 2016.

[2]
C. M. Bishop.
Pattern Recogntion and Machine Learning
. Springer, 2007.  [3] S. R. Bulò, L. Porzi, and P. Kontschieder. InPlace Activated BatchNorm for MemoryOptimized Training of DNNs. In CVPR, 2018.
 [4] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. EncoderDecoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV, 2018.
 [5] J. Domke. Learning graphical model parameters with approximate marginal inference. IEEE transactions on pattern analysis and machine intelligence, 35(10):24542467, 2013.
 [6] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, , and V. Lempitsky. Domain adversarial training of neural networks. JMLR, 2016.
 [7] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3D Traffic Scene Understanding from Movable Platforms. PAMI, 2014.
 [8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR), 2013.
 [9] C. Godard, O. M. Aodha, and G. J. Brostow. Unsupervised Monocular Depth Estimation with LeftRight Consistency. In CVPR, 2017.
 [10] R. Guo and D. Hoiem. Beyond the line of sight: labeling the underlying surfaces. In ECCV, 2012.
 [11] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive Mapping and Planning for Visual Navigation. In CVPR, 2017.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
 [13] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
 [14] L. Kunze, T. Bruls, T. Suleymanov, and P. Newman. Reading between the Lanes: Road Layout Reconstruction from Partially Segmented Scenes. In International Conference on Intelligent Transportation Systems (ITSC), 2018.
 [15] I. Laina, V. B. Christian Rupprecht, F. Tombari, and N. Navab. Deeper Depth Prediction with Fully Convolutional Residual Networks. In 3DV, 2016.
 [16] C. Liu, A. G. Schwing, K. Kundu, R. Urtasun, and S. Fidler. Rent3D: FloorPlan Priors for Monocular Layout Estimation. In CVPR, 2015.
 [17] G. Máttyus, S. Wang, S. Fidler, and R. Urtasun. HD Maps: Finegrained Road Segmentation by Parsing Ground and Aerial Images. In CVPR, 2016.
 [18] NuTonomy. The NuScenes data set. https://www.nuscenes.org, 2018.
 [19] OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org, 2017.
 [20] S. R. Richter, Z. Hayder, and V. Koltun. Playing for Benchmarks. In ICCV, pages 22322241. IEEE, 2017.
 [21] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for Data: Ground Truth from Computer Games. In ECCV, pages 102118. Springer, 2016.

[22]
C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer.
Optimizing binary mrfs via extended roof duality.
In
Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on
, pages 18. IEEE, 2007.  [23] S. Schulter, M. Zhai, N. Jacobs, and M. Chandraker. Learning to Look around Objects for TopView Representations of Outdoor Scenes. In ECCV, 2018.
 [24] A. Seff and J. Xiao. Learning from Maps: Visual Common Sense for Autonomous Driving. arXiv:1611.08583, 2016.
 [25] S. Sengupta, P. Sturgess, L. Ladický, and P. H. S. Torr. Automatic Dense Visual Semantic Mapping from StreetLevel Imagery. In IROS, 2012.
 [26] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic Scene Completion from a Single Depth Image. In CVPR, 2017.
 [27] S. Song, A. Zeng, A. X. Chang, M. Savva, S. Savarese, and T. Funkhouser. Im2Pano3D: Extrapolating 360 Structure and Semantics Beyond the Field of View. In CVPR, 2018.

[28]
C. Sutton and A. McCallum.
Piecewise training for undirected models.
In
Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence
, pages 568575. AUAI Press, 2005.  [29] J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing with object instances and occlusion ordering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
 [30] Y.H. Tsai, W.C. Hung, S. Schulter, K. Sohn, M.H. Yang, and M. Chandraker. Learning to Adapt Structured Output Space for Semantic Segmentation. In CVPR, 2018.
 [31] S. Tulsiani, R. Tucker, and N. Snavely. Layerstructured 3D Scene Inference via View Synthesis. In ECCV, 2018.
 [32] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci. Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation. In CVPR, 2018.
 [33] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. In CVPR, 2017.

[34]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. H. Torr.
Conditional random fields as recurrent neural networks.
In Proceedings of the IEEE international conference on computer vision, pages 15291537, 2015.