The Right (Angled) Perspective: Improving the Understanding of Road Scenes using Boosted Inverse Perspective Mapping

by   Tom Bruls, et al.

Many tasks performed by autonomous vehicles such as road marking detection, object tracking, and path planning are simpler in bird's-eye view. Hence, Inverse Perspective Mapping (IPM) is often applied to remove the perspective effect from a vehicle's front-facing camera and to remap its images into a 2D domain, resulting in a top-down view. Unfortunately, however, this leads to unnatural blurring and stretching of objects at further distance, due to the resolution of the camera, limiting applicability. In this paper, we present an adversarial learning approach for generating a significantly improved IPM from a single camera image in real time. The generated bird's-eye-view images contain sharper features (e.g. road markings) and a more homogeneous illumination, while (dynamic) objects are automatically removed from the scene, thus revealing the underlying road layout in an improved fashion. We demonstrate our framework using real-world data from the Oxford RobotCar Dataset and show that scene understanding tasks directly benefit from our boosted IPM approach.



There are no comments yet.


page 1

page 3

page 5

page 6


Structured Bird's-Eye-View Traffic Scene Understanding from Onboard Images

Autonomous navigation requires structured representation of the road net...

Learning to Map Vehicles into Bird's Eye View

Awareness of the road scene is an essential component for both autonomou...

Online Inference and Detection of Curbs in Partially Occluded Scenes with Sparse LIDAR

Road boundaries, or curbs, provide autonomous vehicles with essential in...

Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera

Autonomous navigation requires scene understanding of the action-space t...

Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks

Autonomous vehicles commonly rely on highly detailed birds-eye-view maps...

Autonomous Removal of Perspective Distortion for Robotic Elevator Button Recognition

Elevator button recognition is considered an indispensable function for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous vehicles need to perceive and fully understand their environment to accomplish their navigation tasks. Hence, scene understanding is a critical component within their perception pipeline, not only for navigation and planning, but also for safety purposes. While vehicles use different types of sensors to interpret scenes, cameras are one of the most popular sensing modalities in the field, due to their low cost as well as the availability of well-established image processing techniques.

In recent years, deep learning approaches based on images have been very successful and significantly improved the performance of autonomous vehicles in the context of semantic scene understanding 

[1, 2]. Many of these approaches take images from a front-facing camera as their input. However, interpretations (i.e. segmented pixels) in this perspective cannot be related directly to the vehicle’s action space, which is often encoded in a local and/or global coordinate system.

Fig. 1: Boosted Inverse Perspective Mapping (IPM) to improve the understanding of road scenes. Left: Top-down view created by applying a homography-based IPM to the front-facing image (top), leading to unnatural blurring and stretching of objects at further distance. Right: Improved top-down view generated by our Incremental Spatial Transformer GAN, containing sharper features and a homogeneous illumination, while dynamic objects (i.e. the two cyclists) are automatically removed from the scene.

Images of front-facing cameras as well as their interpretations need to be transformed into a different coordinate system (or view) to be effectively utilized within tasks such as lane detection [3, 4], road marking detection [5], road topology detection [6, 7], object detection/tracking [8, 9, 10], as well as path planning and intersection prediction [11, 12]. This transformation is commonly referred to as Inverse Perspective Mapping (IPM). IPM takes the frontal view as input, applies a homography, and produces a top-down view of the scene by mapping the pixels to a different 2D-coordinate frame, which is also known as bird’s-eye view. In practice, this works well in the immediate proximity of the vehicle (assuming the road surface is planar). However, the geometric properties of objects in the distance are affected unnaturally by this non-homogeneous mapping, as shown in Fig. 1. This limits the performance of applications in terms of accuracy and distance at which they can be reliably applied. More crucially, however, is the effect of inaccurate mappings on the semantic interpretation of scenes, where small inaccuracies can lead to significant qualitative differences. As we demonstrate in Section V-B (Table I), these qualitative differences can manifest themselves in many ways, including missing lanes and/or late detection of stop lines (or other critical road markings).

To overcome these challenges, we present an adversarial learning approach which produces a significantly improved IPM in real time from a single front-facing camera image. This is a difficult problem which is not solved by existing methods, due to the large difference in appearance between the frontal view and IPM. State-of-the-art approaches for cross-domain image translation tasks train (conditional) Generative Adversarial Networks (GANs) to transform images to a new domain [13, 14]. However, these methods are designed to perform aligned appearance transformations and struggle when views drastically change [15]. The latter work, in which a synthetic dataset with perfect ground-truth labels is used to learn IPM, is closest to ours.

We demonstrate in this paper that we are able to generate reliable, improved IPM for larger scenes than in [15], which are therefore able to directly aid scene understanding tasks. We achieve this in real time using real-world data collected under different conditions with a single front-facing camera. Consequently, we must deal with imperfect training labels (see Section IV) created from a sequence of images and ego-motion. An Incremental Spatial Transformer GAN is introduced to address the significant appearance change between the frontal view and IPM. Compared to analytic IPM approaches our learned model is 1) more realistic with sharper contours at long distance, 2) invariant to extreme illumination under different conditions, and 3) removes dynamic objects from the scene to recover the underlying road layout. We make the following contributions in this paper:

  • we introduce an Incremental Spatial Transformer GAN for generating boosted IPM in real time;

  • we explain how to create a dataset for training IPM methods on real-world images under different conditions; and

  • we demonstrate that our boosted IPM approach improves the detection of road markings as well as the semantic interpretation of road scenes in the presence of occlusions and/or extreme illumination.

Ii Related Work

Improved IPM

As indicated in Section I, many applications can be found in the literature that apply IPM. They rely on three assumptions: 1) the camera is in a fixed position with respect to the road, 2) the road surface is planar, and 3) the road surface is free of obstacles. Remarkably, relatively few approaches exist that aim to improve inaccurate IPM, in case one or more of these assumptions are not satisfied.

Several works have tried to adjust for inaccuracies caused by invalidity of the first two assumptions. The authors of [16, 17] used vanishing point detection, [18]estimated the slope of the road according to the lane markings, and [19] employed motion estimation obtained from SLAM. Invalidity of the third assumption is tackled in [20] by using a laser scanner to exclude obstacles from being transformed to IPM. Another approach [21, 22, 23] creates a look up table for all pixels, by taking into account the distance of objects on the road surface, in order to reduce artefacts at further distance. However, these methods generally assume simple environments (i.e. highway). Contrarily, we learn a non-linear mapping more suited for urban scenes.

Very recently, [15] proposed the first learning approach for IPM using a synthetic dataset. The authors introduced BridgeGAN which employs the homography IPM to bridge the significant appearance gap between the frontal view and bird’s-eye view. In contrast, we use real-world data and consequently imperfect labels to generate boosted IPM for larger scenes. Therefore, our learned mapping is directly beneficial for scene understanding tasks (see Section V-B).

Semantic IPM

Several methods use the semantic relations between the two views for different tasks. In [24, 25] conditional random fields in the frontal view and IPM are optimized to retrieve a coarse semantic bird’s-eye-view map from a sequence of camera images. A joint optimization net is trained in [26, 27] to align the semantic cues of the two views. The authors then train a GAN to synthesize a ground-level panorama from the coarse semantic segmentation. However, because aerial images differ significantly in appearance from the ground view, there is a lack of texture and detail in the synthesized images. We generate a more detailed IPM by learning a direct mapping of the pixels from the frontal view which is more useful for autonomous driving applications.

GANs for Novel View Synthesis

The rise of GANs has made it possible to generate new, realistic images from a learned distribution. In order to guide the generation process towards a desired output, GANs can be conditioned on an input image [13, 28]. Until now, these methods were restricted to perform aligned appearance transformations.

In [29], the spatial transformer module was introduced to learn transformations of the input to improve classification tasks. The authors of [30, 31] used similar ideas to synthesize new views of 3D objects or scenes. More recently, these two fields were combined in [32, 33]. In the latter work, realistic compositions of objects are generated for a new viewpoint. However, these techniques are limited to toy datasets or distort real-world scenes with dynamic objects.

Iii Boosted IPM using an Incremental Spatial Transformer GAN

Iii-a Network Overview

As a starting point, we use a state-of-the-art architecture similar to the global enhancer of [28], without employing feature or instance maps. Additionally, as we expect a slight change in scale from the homography-based IPM image to the stitched training labels (see Section IV), we refrain from using any pixel-wise losses and instead use multi-scale discriminator losses[28] combined with a perceptual loss [34],[35] based on VGG16 [36].

Our model follows a largely traditional downsample-bottleneck-upsample architecture, where we reformulate the bottleneck portion of the model as a series of blocks that perform incremental perspective transformations followed by feature enhancement. Each block contains a Spatial Transformer (ST)[29] followed by a ResNet layer [37]. An overview of the architecture is presented in Fig. 2.

Fig. 2: The architecture of the generator with Incremental Spatial ResNet Transformer blocks in the bottleneck.

Iii-B Spatial ResNet Transformer

Since far-away real-world features are represented by a smaller pixel area as compared to identical close-by features, a direct consequence of applying a full perspective transformation to the input is increased unnatural blurring and stretching of features far in the distance. To counteract this effect, our model divides the full perspective transformation into a series of smaller incremental perspective transformations, each followed by a refinement of the transformed feature space using a ResNet block [37]. The intuition behind this is that the slight blurring that occurs as a result of each perspective transformation is restored by the ResNet block that follows it. To maintain the ability to train our model end-to-end, we apply these incremental transforms using a Spatial Transformer [29].

A Spatial Transformer is an end-to-end differentiable sampler, represented in our case by two major components:

  • a convolutional network which receives an input of size , where , and represent the height, width, and number of channels of the input respectively, and outputs a parametrization of a perspective transformation of size , and;

  • a Grid Sampler which takes and as inputs, creates a mapping matrix of size , where and represent the height and width of the output and uses to construct in the following way:

In practice, we decompose , where is initialized with an approximate parametrization of the desired homography, and is the actual output of the convolutional network and represents a learned perturbation or refinement of .

Fig. 3: Examples of created training pairs (which show the difficulties of using real-world data) by stitching IPM images generated from future front-facing camera images using the ego-motion obtained from visual odometry. The left example illustrates 1) movement of dynamic objects by the time the images are stitched and 2) stretching of objects because they are assumed to be on the road surface. The right example shows a significant change of illumination conditions. Both show inaccuracies at further lateral distance (e.g. wavy curb) because of sloping road surface and possibly imprecise motion estimation.

Iii-C Losses

Our architecture stems from [28], but does not make use of any instance feature maps. Due to the potential misalignment between the output of the network and the labels, we rely on a multiscale discriminator loss and a perceptual loss based on VGG16. With being the traditional GAN loss defined over scales as in [28], the final objective thus becomes:


where is the multi-scale discriminator loss:


and is the perceptual loss:


with denoting the number of discriminator layers used in the discriminator loss, and denoting the number of layers from VGG16 that are utilized in the perceptual loss. The weights are used to scale the importance of each layer used in the loss.

Iii-D Implementation details

We choose , , and

. Furthermore, for training, we employ the Adam solver using a base learning rate set at 0.0002, and a batch size of 1, training for 200 epochs. For the loss trade-off, we empirically set

and . At run time, the network performs inference in real time ( Hz) using an NVIDIA TITAN X.

Iv Creating Training Data for Boosted IPM

To evaluate our approach, we use the Oxford RobotCar Dataset [38], which features a 10-km route through urban environments under different weather and lighting conditions.

In order to create training labels which are a better representation of the real world than the standard, homography-based IPM, we use a sequence of images from the front-facing camera and corresponding visual odometry [39], and merge them into a single bird’s-eye-view image.

Fig. 4: Boosted IPM generated by the network (bottom) under different conditions compared to traditional IPM generated by applying a homography (middle) to the front-facing camera image (top). The boosted bird’s-eye-view images contain sharper features (e.g. road markings), more homogeneous illumination, and automatically remove (dynamic) objects from the scene. Consequently, we infer the underlying road layout, which is directly beneficial for various tasks performed by autonomous vehicles.

From the sensor calibrations and the camera’s intrinsic parameters, we compute the transformation which defines the one-to-one mapping between the pixels of the front-facing camera and the bird’s-eye view. Then, using the relative transform obtained by visual odometry between the current image frame of the sequence and the initial frame, we stitch the respective pixels of the current frame into the IPM image at the correct pixel positions. This operation is performed iteratively, overwriting previous IPM pixels with more accurate pixels of subsequent frames, until the vehicle has reached the end of its field of view of the initial image.

As the training labels are created from real-world data (in contrast to the synthetic data of [15]), their quality is limited by several aspects (see examples in Fig. 3):

  • Minor inaccuracies in the estimation of the rotation of the vehicle and sloping road surface can lead to imprecise stitching at further lateral distance.

  • Consecutive image frames may vary significantly in terms of lighting (e.g. due to overexposure), leading to illumination differences in the label which do not naturally occur in the real-world.

  • Dynamic objects in the front-facing view will appear in a different position in future frames. Consequently, they will appear in unexpected places in the label.

  • Objects above the road plane (e.g. vehicles, bicyclists, intersection islands, etc.) undergo a large deformation due to the view transformation. We cannot obtain accurate labels for these in real-world scenarios.

Due to the aforementioned drawbacks, no direct relation exists between the output (boosted IPM) of our network and the stitched labels. Therefore, it is impossible to incorporate a direct pixel-wise loss function, or employ super-resolution generating networks such as

[40]. On the other hand, since we use a sequence of future images, regions that were previously occluded by (dynamic) objects in the initial view are potentially revealed later. This gives the network the ability to learn the underlying road layout irrespective of occlusions or extreme illumination. We train our network using 8416 overcast and 4894 nighttime labels.

V Experimental Results

In this section we present qualitative results generated under different conditions. Due to the nature of the problem, it is extremely hard to capture ground-truth labels in the real world (see Section IV), and thus to present quantitative results for our approach. Furthermore, the synthetic dataset used in [15] is not publicly available. However, despite the lack of current quantitative results, we demonstrate that our boosted IPM has a significant qualitative effect on the semantic interpretation of real-world scenes. Lastly, we show some limitations of the presented framework.

V-a Qualitative Evaluation

Fig. 4 shows qualitative results on a RobotCar test dataset. The results demonstrate that the network has learned the underlying road layout of various urban traffic scenarios. Semantic road features such as parking boxes (i.e. small separators) and stop lines are inferred correctly. Furthermore, dynamic objects, which occlude parts of the scene, are removed and replaced by the correct road/lane boundaries, making the representation more suitable for scene understanding and planning. The boosted IPM contains sharper road markings, which improves the performance of tasks such as lane detection. Lastly, the new view offers a more homogeneous illumination of the road surface, which is beneficial for all tasks that require image processing.

Additionally, we show that our framework is not limited to datasets recorded under overcast conditions. Although artificial lighting introduces colour variations in the output, we are still able to significantly improve the representation of the underlying layout of the scene.

Road marking Detection [41] Scene Interpretation [42]

Homography Boosted IPM (generated from detected road markings)







TABLE I: Qualitative Effects of IPM Methods on Roadmarking Detection and Scene Interpretation

V-B Employing Boosted IPM for Scene Interpretation

We demonstrate the effectiveness of our improved IPM approach for the application of road marking detection [41] and scene interpretation [42] (cf. Table I). Experimentally, we have verified that the proposed IPM method allows us to more robustly detect road markings (1) at greater distance and (2) in more detail, and (3) infer road markings occluded by dynamic objects such as cars and cyclists. These improvements are possible because boosted IPM contains sharper features with more consistent geometric properties (at further distance) and learns the underlying road layout.

We have trained a road marking detection network for each view separately with an equivalent setup according to [41]. The increase in performance for road marking detection has immediate consequences for the interpretation of scenes. In general, all interpretations (scene graphs) benefit from more accurate road marking detection. Table I depicts qualitative differences in scene graphs111In the scene graphs, the qualitative differences resulting from our boosted IPM method are indicated by filled nodes grouped in blue boxes.. In the following we discuss the individual scenes.

Scene (A)

The vehicle approaches a pedestrian crossing which is indicated by the upcoming zig-zag lines. While these road markings are visible to the human eye, the trained road marking detection network was not able to detect them from a standard, homography-based IPM, because the shapes are severely deformed. However, our boosted IPM produced a bird’s-eye-view image with sharper contours in the distance and correct reconstruction of the road markings occluded by the vehicle. This resulted in an improved scene graph which not only captured the right boundary of the ego lane, but also a previously undetected second lane on the right.

Scene (B)

The vehicle drives on a road with four lanes — two inner lanes for vehicles and two outer lanes for cyclists — and experiences a change in illumination (from a darker foreground to a brighter background). This is clearly visible in the homography-based IPM and consequently leads to a poor detection of road markings. In contrast, our boosted approach produces a top-down view which inpaints semantic cues (i.e. road markings) directly over the overexposed area and also excludes the two cyclists. Hence, the resulting scene graph captures more detail as well as an extra lane which was missed in the segmentation resulting from the standard approach.


(C) The vehicle approaches a pedestrian crossing which is indicated by both zig-zag and stop lines. Again, the distorted and blurry image resulting from the homography-based IPM leads to a poor detection of road markings. Our boosted approach has generated a more detailed view which led to better road marking detection including the successful identification of the stop lines. The resulting scene graph based on the homography-based IPM not only misses a lane, but crucially also both stop lines.

Such qualitative differences clearly demonstrate the advantage of our proposed method as they have a direct impact on planning and decision making of autonomous vehicles. While the detection and interpretation of road markings at a greater distance will enable an autonomous vehicle to adapt its behaviour earlier, the detection of road markings behind moving objects will lead to performance that is more robust and safer even when the scene is partly occluded.

V-C Failure Cases

Under certain conditions, the boosted IPM does not accurately depict all details of the bird’s-eye view of the scene.

As we cannot enforce a pixel-wise loss during training (Section IV), the shape of certain road markings is not accurately reflected (illustrated in Fig. 5). Improvement of the representation of these structural elements will be investigated in future work.

Furthermore, the spatial transformer blocks assume that the road surface is more or less planar (and perpendicular to the -axis of the vehicle). When this assumption is not satisfied, the network is unable to accurately reflect the top-down scene at further distance. This might be solved by providing/learning the rotation of the road surface with respect to the vehicle.

Fig. 5: Two cases in which the output of the network does not accurately depict the top-down view of the scene. In the left image, the road marking arrow is deformed, because we cannot employ a pixel-wise loss. In the right image, the road surface is not flat (sloping upwards), consequently the spatial transformer blocks map parts of the scene above the horizon, for which the features are not learned.

Vi Conclusion

We have presented an adversarial learning approach for generating boosted IPM from a single front-facing camera image in real time. The generated results show sharper features and a more homogeneous illumination, while (dynamic) objects are automatically removed from the scene. Overall, we infer the underlying road layout, which is directly beneficial for tasks performed by autonomous vehicles such as road marking detection, object tracking, and path planning.

In contrast to existing approaches, we used real-world data collected under different conditions, which introduced additional issues due to varying illumination and (dynamic) objects, making it difficult to employ a pixel-wise loss during training. We have addressed the significant appearance change between the views by training an Incremental Spatial Transformer GAN.

We have demonstrated reliable, qualitative results in different environments and under varying lighting conditions. Furthermore, we have shown that the boosted IPM view allows for improved hierarchical scene understanding.

Consequently, our boosted IPM approach can have a significant impact on a wide range of applications in the context of autonomous driving including scene understanding, navigation, and planning.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [2] L. Schneider, M. Cordts, T. Rehfeld, D. Pfeiffer, M. Enzweiler, U. Franke, M. Pollefeys, and S. Roth, “Semantic stixels: Depth is not enough,” in Intelligent Vehicles Symposium (IV), 2016 IEEE.   IEEE, 2016, pp. 110–117.
  • [3] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool, “Towards end-to-end lane detection: An instance segmentation approach,” arXiv preprint arXiv:1802.05591, 2018.
  • [4] W. Song, Y. Yang, M. Fu, Y. Li, and M. Wang, “Lane detection and classification for forward collision warning system based on stereo vision,” IEEE Sensors Journal, vol. 18, no. 12, pp. 5151–5163, 2018.
  • [5] B. Mathibela, P. Newman, and I. Posner, “Reading the road: Road marking classification and interpretation,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 4, pp. 2072–2081, 2015.
  • [6] A. L. Ballardini, D. Cattaneo, S. Fontana, and D. G. Sorrenti, “An online probabilistic road intersection detector,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on.   IEEE, 2017, pp. 239–246.
  • [7] S. Schulter, M. Zhai, N. Jacobs, and M. Chandraker, “Learning to look around objects for top-view representations of outdoor scenes,” arXiv preprint arXiv:1803.10870, 2018.
  • [8]

    J. Dequaire, P. Ondrúška, D. Rao, D. Wang, and I. Posner, “Deep tracking in the wild: End-to-end tracking using recurrent neural networks,”

    The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 492–512, 2018.
  • [9] N. Engel, S. Hoermann, P. Henzler, and K. Dietmayer, “Deep object tracking on dynamic occupancy grid maps using RNNs,” arXiv preprint arXiv:1805.08986, 2018.
  • [10] N. Simond and M. Parent, “Obstacle detection from IPM and super-homography,” in Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on.   IEEE, 2007, pp. 4283–4288.
  • [11] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 336–345.
  • [12] A. Zyner, S. Worrall, and E. Nebot, “Naturalistic driver intention and path prediction using recurrent neural networks,” arXiv preprint arXiv:1807.09995, 2018.
  • [13]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 5967–5976.
  • [14] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on.   IEEE, 2017, pp. 2242–2251.
  • [15] X. Zhu, Z. Yin, J. Shi, H. Li, and D. Lin, “Generative adversarial frontal view to bird view synthesis,” arXiv preprint arXiv:1808.00327, 2018.
  • [16] M. Nieto, L. Salgado, F. Jaureguizar, and J. Cabrera, “Stabilization of inverse perspective mapping images based on robust vanishing point estimation,” in Intelligent Vehicles Symposium, 2007 IEEE.   IEEE, 2007, pp. 315–320.
  • [17] D. Zhang, B. Fang, W. Yang, X. Luo, and Y. Tang, “Robust inverse perspective mapping based on vanishing point,” in Security, Pattern Analysis, and Cybernetics (SPAC), 2014 International Conference on.   IEEE, 2014, pp. 458–463.
  • [18] M. Bertozzi, A. Broggi, and A. Fascioli, “An extension to the inverse perspective mapping to handle non-flat roads,” in IEEE International Conference on Intelligent Vehicles. Proceedings of the 1998 IEEE International Conference on Intelligent Vehicles, vol. 1, 1998.
  • [19] J. Jeong and A. Kim, “Adaptive inverse perspective mapping for lane map generation with SLAM,” in Ubiquitous Robots and Ambient Intelligence (URAI), 2016 13th International Conference on.   IEEE, 2016, pp. 38–41.
  • [20] M. Oliveira, V. Santos, and A. D. Sappa, “Multimodal inverse perspective mapping,” Information Fusion, vol. 24, pp. 108–121, 2015.
  • [21] C.-C. Lin and M.-S. Wang, “A vision based top-view transformation model for a vehicle parking assistant,” Sensors, vol. 12, no. 4, pp. 4431–4446, 2012.
  • [22] P. Cerri and P. Grisleri, “Free space detection on highways using time correlation between stabilized sub-pixel precision IPM images,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on.   IEEE, 2005, pp. 2223–2228.
  • [23] J. M. Menéndez García and N. Yaghoobi Ershadi, “A new strategy of detecting traffic information based on traffic camera: Modified inverse perspective mapping,” Journal of Electrical Engineering, Technology and Interface Utilities, vol. 10, no. 2, pp. 1101–1118, 2017.
  • [24] S. Sengupta, P. Sturgess, P. H. Torr et al., “Automatic dense visual semantic mapping from street-level imagery,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on.   IEEE, 2012, pp. 857–862.
  • [25] G. Máttyus, S. Wang, S. Fidler, and R. Urtasun, “HD maps: Fine-grained road segmentation by parsing ground and aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3611–3619.
  • [26] M. Zhai, Z. Bessinger, S. Workman, and N. Jacobs, “Predicting ground-level scene layout from aerial imagery,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on.   IEEE, 2017, pp. 4132–4140.
  • [27] K. Regmi and A. Borji, “Cross-view image synthesis using geometry-guided conditional gans,” arXiv preprint arXiv:1808.05469, 2018.
  • [28] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional GANs,” in Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on.   IEEE, 2018, pp. 1–13.
  • [29] M. Jaderberg, K. Simonyan, A. Zisserman et al.

    , “Spatial transformer networks,” in

    Advances in neural information processing systems, 2015, pp. 2017–2025.
  • [30] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision,” in Advances in Neural Information Processing Systems, 2016, pp. 1696–1704.
  • [31] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in European conference on computer vision.   Springer, 2016, pp. 286–301.
  • [32]

    D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3D structure from images,” in

    Advances in Neural Information Processing Systems, 2016, pp. 4996–5004.
  • [33] S. Azadi, D. Pathak, S. Ebrahimi, and T. Darrell, “Compositional GAN: Learning conditional image composition,” arXiv preprint arXiv:1807.07560, 2018.
  • [34] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision.   Springer, 2016, pp. 694–711.
  • [35] A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
  • [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
  • [38] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017.
  • [39] W. Churchill, “Experience based navigation: Theory, practice and implementation,” Ph.D. dissertation, University of Oxford, Oxford, United Kingdom, 2012.
  • [40] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on.   IEEE, 2017, pp. 105–114.
  • [41] T. Bruls, W. Maddern, A. A. Morye, and P. Newman, “Mark yourself: Road marking segmentation via weakly-supervised annotations from multimodal data,” in Robotics and Automation (ICRA), 2018 IEEE International Conference on.   IEEE, 2018.
  • [42] L. Kunze, T. Bruls, T. Suleymanov, and P. Newman, “Reading between the lanes: Road layout reconstruction from partially segmented scenes,” in IEEE International Conference on Intelligent Transportation Systems (ITSC), 2018.