In recent years, the development of automated vehicles (AVs) has received substantial attention from both research and industry. One of the key elements of automated driving is the accurate perception of an AV’s environment. It is essential for planning safe and efficient behavior.
Different types of environment representations can be computed, e.g. object lists or occupancy grids. Both require information on the world coordinates of elements in the environment. Among the different types of sensors commonly used to achieve an understanding of the environment, cameras are popular due to low cost and well-established computer vision techniques. Since monocular cameras can only provide information on locations in the image plane, a perspective transformation can be applied to images that results in a top-down or bird’s eye view (BEV). It is an approximation of the same scene as seen from a perspective in which the image plane aligns with the ground plane in front of the camera. The method used for transforming camera images to BEV is commonly referred to asInverse Perspective Mapping (IPM) [MallotEtAl_InversePerspectiveMapping_1991].
IPM assumes the world to be flat. Any three-dimensional object and changing road elevations violate this assumption. Mapping all pixels to a flat plane thus results in strong visual distortions of such objects. This impedes our goal of accurately locating objects such as other vehicles and vulnerable road users in the vehicle’s environment. For this reason, images transformed through IPM often only serve as input to algorithms for lane detection or free space computation, for which the flat world assumption is often reasonable [BarHillelEtAl_RecentProgressRoad_2014].
Even if errors introduced by IPM could be corrected, we are left with the task of detecting objects in the BEV. Deep learning approaches have proven to be powerful for tasks like semantic segmentation of images but usually require vast amounts of manually labeled data. Simulations can provide BEV images and their corresponding labels but suffer from the so-called reality gap: BEV images computed by a virtual camera in a simulated environment are rather dissimilar to e.g. a drone image captured above a vehicle in the real world, mostly due to unrealistic textures in the simulation. The generalization from a complex task learned in a simulation to the real world has therefore proven to be difficult so far. In order to reduce the reality gap, many approaches thus aim at making simulated data more realistic, e.g. [ZhaoEtAl_MultiSourceDomain_2019].
In this paper, we propose a methodology to obtain BEV images that are not subject to the errors introduced by the flatness-assumption underlying IPM. Instead of trying to make simulated images look more realistic, we remove mostly unnecessary texture from real-world data by computing semantically segmented camera images. We show how their use as input to our algorithm allows us to train a neural network on synthetic data only, while still being able to successfully perform the desired task on real-world data. With semantically segmented input, the algorithm has access to class information and is thus able to incorporate these into the correction of images produced by IPM. The output is a semantically segmented BEV of the input scene. Since the object shapes are preserved, the output can not only be used for determining free space but also to locate dynamic objects. In addition, the semantically segmented BEV images contain a color-coding for unknown areas, which are occluded in the original camera image. The image obtained through IPM and the desired ground truth BEV image are displayed in Fig. 1.
The main contributions of this work are as follows:
We propose a methodology capable of transforming the images of multiple vehicle-mounted cameras to semantically segmented images in BEV.
We design and compare two variations of our methodology using different neural network architectures, one of which we specifically design for the task.
We design the process in such a way that no manual labeling of BEV images is required for training our neural network-based models.
We show a successful real-world application of the trained models.
Ii Related Work
Numerous works of literature address the perspective transformation to BEV. In the automotive context, both [SungEtAl_DevelopmentImageSynthesis_2012] and [ZhangEtAl_SurroundViewCamera_2014] deal with the synthesized transformation of multiple camera images to a top-down surround view. Most works are geometry-based and focus on an accurate depiction of the ground level.
Only few works combine the transformation to BEV with the task of scene understanding. However, object detection can give clues on an object’s geometry, from which the transformation could benefit. Recently, the deep learning approaches presented below have shown how complex neural networks can aid in improving the classical IPM technique and contribute to environment perception.
The focus of [BrulsEtAl_RightAngledPerspective_2019] and [ZhuEtAl_GenerativeAdversarialFrontal_2019] is to correct the errors introduced by the IPM approach. Dynamic and three-dimensional objects are sought to be removed in the transformed BEV achieved by [BrulsEtAl_RightAngledPerspective_2019] to improve road scene understanding. In contrast, the method proposed in [ZhuEtAl_GenerativeAdversarialFrontal_2019] aims to synthesize an accurate BEV representation of an entire road scene as seen through a front-facing camera, including dynamic objects. Due to the generative nature of the underlying task, both methods employ Generative Adversarial Networks [Schmidhuber_MakingWorldDifferentiable_1990, GoodfellowEtAl_GenerativeAdversarialNets_2014].
Palazzi et al. [PalazziEtAl_LearningMapVehicles_2017] present the prediction of vehicle bounding boxes in BEV from the images of a front-facing camera. Roddick et al. [RoddickEtAl_OrthographicFeatureTransform_2019] demonstrate advanced object detection in computing three-dimensional bounding boxes by using an in-network orthographic feature transform to a three-dimensional discretization of space.
A semantic road understanding in a top-down frame leading to a coarse and static semantic map is achieved in [SenguptaEtAl_AutomaticDenseVisual_2012]. Similar to [BrulsEtAl_RightAngledPerspective_2019], this approach tries to remove dynamic traffic participants.
To the best of our knowledge, the only source pursuing the idea of directly transforming multiple semantically segmented images to BEV is a blog article [Dziubinski_SemanticSegmentationSemantic_2019]
. It lacks detailed testing and an application to real-world data though. The designed neural network is a fully-convolutional autoencoder and has multiple weaknesses, e.g. the range of an accurate object detection is relatively low.
We base our methodology on the use of a Convolutional Neural Network (CNN), a class of deep neural networks commonly used for image analysis. Most popular CNNs process only one input image. In order to fuse images from multiple cameras mounted on a vehicle, a single-input network could take as input multiple images concatenated along their channel dimension. However, for the task at hand, this would result in spatial inconsistency between input and output images. Convolutional layers operate locally, i.e. information in particular parts of the input are mapped to approximately the same part of the output. An end-to-end learning approach for the presented problem however needs to be able to handle images from multiple viewpoints. This suggests the need for an additional mechanism.
IPM certainly introduces errors, but the technique is capable of producing an image at least similar to a ground truth BEV image. Due to this similarity, it seems reasonable to incorporate IPM as a mechanism to provide better spatial consistency between input and output images. The image resulting from IPM is also used as an intermediate guiding view in [BrulsEtAl_RightAngledPerspective_2019] and [ZhuEtAl_GenerativeAdversarialFrontal_2019]. In the following, we present two variations of our neural network-based methodology that both include the application of IPM. Before introducing the two neural network architectures, the applied data preprocessing techniques are explained in detail.
Iii-a Dealing with Occlusions
When only considering the input domain and the desired output for this task, one difficulty immediately becomes apparent: traffic participants and static obstacles may occlude parts of the environment making predictions for those areas in a BEV image mostly impossible. As an example, such occlusions would occur when driving behind a truck: what is happening in front of the truck cannot reliably be determined only from vehicle-mounted camera images.
In order to formulate a well-posed problem, an additional semantic class needs to be introduced for areas in BEV, which are occluded in the camera perspectives. This class is introduced to the ground truth label images in a preprocessing step. For each vehicle camera, virtual rays are cast from its mount position to the edges of the semantically segmented ground truth BEV image. The rays are only cast to edge pixels that lie within the specific camera’s field of view. All pixels along these rays are processed to determine their occlusion state according to the following rules:
some semantic classes always block sight
(e.g. building, truck);
some semantic classes never block sight (e.g. road);
cars block sight, except on taller objects behind them (e.g. truck, bus);
partially occluded objects remain completely visible;
objects are only labeled as occluded if they are occluded in all camera perspectives.
A ground truth BEV image modified according to these rules is showcased in Fig. 2.
Iii-B Projective Preprocessing
As part of the incorporation of the IPM technique into our methods, the homographies, i.e. the projective transformations between vehicle camera frames and BEV are derived. The determination of the correct homography matrix involves intrinsic and extrinsic camera parameters and shall be briefly described below.
The relationship between homogeneous world coordinates and homogeneous image coordinates is given by the projection matrix as
The projection matrix encodes the camera’s intrinsic parameters (e.g., focal length) in a matrix and extrinsics (rotation and translation w.r.t. the world frame):
Assuming there exists a transformation from the road plane to the world frame, s.t.
we obtain a transformation from image coordinates to the road plane:
Note that (1) is generally not invertible, as infinitely many world points correspond to the same image pixel. The assumption of a planar surface, encoded in
, makes it possible to construct the invertible matrix.
In order to determine for real-world cameras, camera calibration methods [KaehlerBradski_LearningOpenCVComputer_2017] can be used.
As a preprocessing step to the first variation of our approach (Section III-C), IPM is applied to all images from the vehicle cameras. The transformation is set up to capture the same field of view as the ground truth BEV image. As this area is only covered by the union of all camera images, they are first separately transformed via IPM and then merged into a single image, hereafter called the homography image. Pixels in overlapping areas, i.e. areas visible from two cameras, are chosen arbitrarily from one of the transformed images.
Iii-C Variation 1: Single-Input Model
As the first variation of our approach, we propose to pre-compute the homography image as presented in Section III-B in order to bridge a large part of the gap between camera views and BEV. Hereby we provide, to some extent, spatial consistency between neural network input and output. The network’s task then is to correct the errors introduced by IPM.
To the best of our knowledge, there exist no single-input neural network architectures, which specifically target the problem at hand. However, since the homography image and the desired target output image cover the same spatial region, we propose to use existing CNNs for image processing, which have proven successful at other tasks such as semantic segmentation.
We choose DeepLabv3+ as the architecture for our proposed single-network-input method. DeepLabv3+ as presented in [ChenEtAl_EncoderDecoderAtrousSeparable_2018] is a state-of-the-art CNN for semantic image segmentation. With MobileNetV2 [SandlerEtAl_MobileNetV2InvertedResiduals_2018] and Xception [Chollet_XceptionDeepLearning_2017], two different network backbones are tested. The resulting neural networks have approximately 2.1M and 41M trainable parameters.
Iii-D Variation 2: Multi-Input Model
In contrast to the first network architecture presented in Section III-C, we propose a second neural network that processes all non-transformed images from the vehicle cameras as input. It therefore extracts features in the non-transformed camera views and is thus not fully subject to the errors introduced by the IPM. As a way to deal with the problem of spatial inconsistency, we integrate projective transformations into the network.
In order to build an architecture for multiple input and one output image, we propose to extend an existing CNN to multiple input streams with a fusion of said streams inside. Due to its simplicity and thus easy extensibility, we choose the popular semantic segmentation architecture U-Net [RonnebergerEtAl_UNetConvolutionalNetworks_2015] as the basis for the extensions presented in the following.
The base architecture consists of a convolutional encoder and decoder path based on successive pooling and upsampling, respectively. Additionally, high-resolution features from the encoder side are combined with upsampled outputs on the decoder side via skip-connections on each scale. Fig. 3 shows the architecture including the two extensions that are introduced in order to handle multiple input images and add spatial consistency:
The encoder path is separately replicated for each input image. For every scale, features from each input stream are concatenated and convoluted to build the skip-connection to the single decoder path.
Before concatenating the input streams, Spatial Transformer [JaderbergEtAl_SpatialTransformerNetworks_2015] units projectively transform the feature maps using the fixed homography as obtained by IPM. These transformers are explained in more detail in Fig. 4.
The neural network is named uNetXST due to its extension to arbitrarily many inputs and the Spatial Transformer units. It contains approximately 9.6M trainable parameters.
Iv Experimental Setup
In order to evaluate the methodology presented before, we train the neural networks entirely on simulated data. In the following, we present the synthetic dataset and the training setup.
Iv-a Data Acquisition
The data used to train and assess our proposed methodology is created in the simulation environment Virtual Test Drive (VTD) [Neumann-CoselEtAl_VirtualTestDrive_2009]. A recording toolchain allows the generation of potentially arbitrarily many sample images including their corresponding label.
In the simulation, the ego vehicle is equipped with four identical virtual wide-angle cameras covering a full surround view. Ground truth data is provided by a virtual drone camera. The BEV ground truth image is centered above the ego vehicle and has an approximate field of view of .
Both input and ground truth images are recorded at a resolution of . All virtual cameras produce both realistic and semantically segmented images. For semantic segmentation, nine different semantic classes are considered for the visible areas (road, sidewalk, person, car, truck, bus, bike, obstacle, vegetation).
As a trade-off between keeping simulation time low and maximizing data variety, images are recorded at . In total, the dataset contains approximately samples for training and samples for validation, where each sample is a set of multiple input images and one ground truth label. As we only require our method to operate in specified spatial areas, the static elements in the simulated world (i.e. roads, buildings, etc.) remain the same between training and validation data.
In order to later test a real-world application of our methods, a second synthetic dataset is recorded for usage with a single front camera. In this scenario, only three classes are considered for visible areas (road, vehicle, occupied space) and only the area in front of the vehicle is of interest. For this reason, the ground truth images are left-aligned with the ego vehicle. The second dataset contains approximately samples for training and samples for validation.
Iv-B Training Setup
To keep training and inference time relatively short, network input images and target labels are center-cropped to an aspect ratio of 2:1 and resized to a resolution of
. The input images are converted to a one-hot representation. In order to counter class imbalance in the dataset, the loss function is modified to weigh semantic classes according to the logarithm of their relative occurrence. During training, the Adam optimizer with a learning rate ofand parameters and is applied to batches of size .
Iv-C Evaluation Metrics
The Intersection-over-Union (IoU) score is used as the main metric for model performance on the task of predicting a certain semantic class. Class IoU scores are averaged into a single Mean Intersection-over-Union (MIoU) score.
V Results and Discussion
In this section, we compare the performance of our method variations to each other and discuss the overall improvements of our methodology compared to the classical IPM technique. The standard homography image obtained by IPM is used as the baseline for our evaluation.
We present results for the two single-input models DeepLab Xception and DeepLab MobileNetV2 as well as the multi-input model uNetXST. In order to quantify the benefit of incorporating homographies into our approach, we also present results for alternative model versions without IPM. For the model of our first method variation, DeepLab, this means to simply concatenate the multiple input images along their channel dimension, as explained in the beginning of Section III. For the uNetXST model, this means to ablate the Spatial Transformer units. In the following, these simplified models are denoted by an asterisk (*).
Additionally, we qualitatively test the hypothesis that the proposed methodology can generalize from simulated to real-world data.
V-a Results on Synthetic Data
The performance of our models compared to the baseline is reported in Table I.
The uNetXST model achieves the highest MIoU score on the validation set. This is the case even though uNetXST contains substantially fewer trainable parameters than DeepLab Xception, which is the second best performing network. The result can be seen as evidence for the hypothesis that the approach using uNetXST benefits from being able to extract features from the non-transformed camera images, before perspective errors are introduced by IPM.
The results of the ablation study of omitting IPM from our approach (*) suggest that the erroneous homography view can indeed help to improve performance. Compared to the homography baseline itself, our proposed approach generally achieves a considerably higher performance. The values indicate that both variations of our method can successfully improve the results obtained by IPM for environment perception.
In order to further analyze the performance on a class basis, we present the respective class IoU scores in Table II.
All three proposed networks perform best on the prediction of semantic classes covering large areas, e.g. road and vegetation. Good IoU scores are achieved for cars, trucks, and buses, which are all dynamic traffic participants. All models struggle with the correct prediction and localization of bikes and especially persons. This can be attributed to the fact that both classes represent small objects in BEV and also show the least occurrence in the training dataset. The uNetXST results for bikes and persons indicate that the method can indeed profit from processing the raw and non-transformed camera images. Further measures to counter class imbalance, apart from a weighted loss function, could improve the results on these two classes. The models without IPM (*) consistently perform worse than their counterparts.
A qualitative comparison between our two method variations and the baseline can be made by analyzing the examples depicted in Fig. 5. For both exemplary scenes, we present the input images of the four vehicle-mounted cameras, the ground truth image, the homography image, and predictions from our DeepLab Xception and uNetXST approaches.
The errors introduced by IPM’s flat world assumption are clearly visible in the homography images. Our two models perform well at computing the correct BEV of the scene.
For the first example, moving and parked vehicles are localized particularly well and the predicted object dimensions closely match the ground truth data. The occlusion shadows are reasonably cast and are intercepted by the detection of the two buildings. Note that in contrast to uNetXST, the DeepLab Xception model cannot reliably infer the building dimensions from the homography image. The second example poses another challenging scene at a 4-way intersection with cars, a truck and a motorcycle. Our results show a good localization of traffic participants. The estimation of object dimensions seems worse compared to the first example. However, due to the intersection, the vehicles are slightly rotated to each other, which is not the case for most of the training samples. Note that the rightmost car is almost completely occluded and thus not properly detected.
Compared to the homography image, both variations of our approach successfully eliminate errors introduced by IPM. Additionally, they reasonably predict areas in BEV, which are occluded from the vehicle camera perspective.
V-B Real-World Application
In order to test our methodology on real-world data, we need a way of obtaining semantically segmented camera images as input to our approach. To this end, we employ an extra CNN for semantic segmentation that achieves a MIoU score of % on an internally labeled testing dataset.
The BEVs of two real-world scenes as computed by our DeepLab Xception and uNetXST models are shown in Fig. 6. Both make reasonable predictions for the location and dimension of other traffic participants, but the uNetXST model produces smoother and qualitatively better results.
In the first example, both networks reasonably predict the positions and dimensions of the parked vehicles on the left and the car ahead. In the second example, all five visible vehicles, even partially occluded ones, are detected by both models. Here, uNetXST generally produces more reasonable object dimensions, especially for the more distant vehicles on the right.
Note that due to vehicle dynamics, in reality, a vehicle camera’s pose relative to the road plane is not constant, as was the case for the simulated data. The fixed IPM transformation used for both models could thus be miscalibrated in the scenes depicted in Fig. 6. Measuring vehicle dynamics and incorporating dynamic transformation changes into the network inference could therefore still improve the results in the real world.
We have proposed a methodology capable of transforming the images of multiple vehicle-mounted cameras to semantically segmented images in bird’s eye view. In the process, errors resulting from the incorrect flatness assumption underlying Inverse Perspective Mapping are removed. The usage of synthetic datasets and an input abstraction to semantically segmented representations of the camera images allows the application to real-world data without manual labeling of BEV images. Additionally, our method is able to accurately predict occluded areas in BEV images. We have designed the neural network uNetXST, which processes multiple inputs and employs in-network transformations. This way the network is able to outperform popular architectures such as DeepLab Xception on the task. All models trained using our approach quantitatively and qualitatively outperform the results obtained by only applying Inverse Perspective Mapping.
Further research is motivated by the potential contribution the presented methodology can make to environment perception via cameras. One promising idea is to incorporate further input such as depth information. Depth information could be computed from stereo cameras, estimated by approaches for monocular camera depth estimation, or obtained from sensors such as LiDAR. Regarding a real-world application, the approach needs to be tested with a 360 multi-camera setup, which will require good semantic segmentation performance not only on front camera images.