Hierarchical Road Topology Learning for Urban Map-less Driving

by   Li Zhang, et al.

The majority of current approaches in autonomous driving rely on High-Definition (HD) maps which detail the road geometry and surrounding area. Yet, this reliance is one of the obstacles to mass deployment of autonomous vehicles due to poor scalability of such prior maps. In this paper, we tackle the problem of online road map extraction via leveraging the sensory system aboard the vehicle itself. To this end, we design a structured model where a graph representation of the road network is generated in a hierarchical fashion within a fully convolutional network. The method is able to handle complex road topology and does not require a user in the loop.



page 1

page 2

page 3

page 8


Convolutional Recurrent Network for Road Boundary Extraction

Creating high definition maps that contain precise information of static...

CP-loss: Connectivity-preserving Loss for Road Curb Detection in Autonomous Driving with Aerial Images

Road curb detection is important for autonomous driving. It can be used ...

VectorMapNet: End-to-end Vectorized HD Map Learning

Autonomous driving systems require a good understanding of surrounding e...

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

High Definition (HD) maps are maps with precise definitions of road lane...

Hierarchical Recurrent Attention Networks for Structured Online Maps

In this paper, we tackle the problem of online road network extraction f...

Learning from Maps: Visual Common Sense for Autonomous Driving

Today's autonomous vehicles rely extensively on high-definition 3D maps ...

Learning a Model for Inferring a Spatial Road Lane Network Graph using Self-Supervision

Interconnected road lanes are a central concept for navigating urban roa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous vehicles tend to rely on data-hungry perception algorithms to comprehend their surroundings. They are loaded with a constellation of sensors collecting data from the ambient environment, such as position and dimensions of the surrounding objects, weather condition, and traffic, but also large, detailed, and accurate global maps.
Maps are an indispensable component of self-driving technology. The unique needs of autonomous vehicles necessitate a new class of HD maps for prior map-based localization, modeling the road surface at centimeter-level accuracy, enabling an autonomous vehicle to confidently deduce its position with respect to the ambient environment. With such strong prior knowledge, an autonomous vehicle is able to enhance its perception and react better to the events on the road beyond the reach of on-board sensors, facilitating its interactions with other traffic participants. As such, HD maps, generally provide 3D geometric and semantic information on static and physical parts of the world, including lane boundaries, intersections, crosswalks, parking spots, stop signs, and traffic lights. These static maps are computed offline, typically using the sensors of the self-driving vehicle itself, although manual annotations or modifications are often required.

(a) Input grid map
(b) HD map
(c) Ego drivable lanes
(d) Road topology
Figure 1: Example of autonomous vehicle's potential drivable lanes and road topology.

While this paradigm has served to facilitate Autonomous Driving, such dependence on detailed prior maps is undesirable for global scale, as it requires a wealth of information about the geometry and traffic rules of every single road, and consequently, large volumes of data and storage space. Such maps are not only expensive to store on-board autonomous vehicles, but also very laborious to create and maintain [homayounfar2018hierarchical]. Furthermore, it is of extreme importance for the maps to reflect the latest state and components of roads, e.g. repainted road markings, blocked roads, construction sites, at all times. This further compounds the problem and renders this paradigm impractical at scale.

Figure 2: Overview of our road topology learning pipeline. Our model is a multi-stage, multi-task network trained to sequentially create the map components starting from the segmented lanes up to the complete road topology.

To enhance the practicality and scalability of a self-driving vehicle, we herein propose a solution to transition from heavily relying on HD maps to adopting a methodology with online mapping features. This solution, thereby, eliminates the need for the creation, manipulation, and maintenance of highly accurate maps. The system, as a result, is rendered more agile by adapting to road conditions and does not require precise localization. Figure 2 depicts an overview of our proposed method.
We devise a learning methodology where the road condition and its components are learned in its current state, independent of the road complexity and number of lanes. The model receives snapshots of the vehicle's surroundings reflecting instantaneous environmental condition as well as the road structure and obstructions. Then, it predicts the road topology in a hierarchical fashion; detecting the ego vehicle's drivable lanes at the low level and, subsequently, connecting this information to a global topological map for robust navigation. In the context of maps, the method produces a road network map, i.e., graph where edges are polylines corresponding to road segments, and vertices represent spatial coordinates of start, end, and fork points of each lane segment. This map dynamically varies with respect to ego vehicle's position and direction to contain only the relevant information for vehicle's planning (Figure 1).

2 Related Work

To reduce the dependency on maps, several techniques have been developed focusing mainly on predicting drivable routes, which can then be used for generating path proposals. Traditional methods establish connectivity by incorporating contextual priors such as color and texture information [kong2010general, chiu2005lane] or road geometry [dickmanns1992recursive, kuhnl2012spatial]. Some approaches leverage the environment structure and rely on distinct features, such as lane markings and curbs [li2016deep, topfer2014efficient, neven2018towards, beck2014non, lee2018development]. To extend these systems to complex urban environments and rural or undeveloped areas in the absence of clear or consistent lane markings, a class of approaches cast the problem as semantic segmentation to capture large spatial context [suleymanov2018inferring, liang2019convolutional, lee2017vpgnet, he2016accurate, barnes2017find]

, or estimate the lane geometry as well as the semantics of each lane

[meyer2018deep, homayounfar2018hierarchical].

To enable navigation and drive in the absence of detailed maps based on a comprehensive understanding of the immediate environment while following simple higher level directions, some approaches include rough map priors as a baseline. In [ort2019maplite], a map-less driving framework is proposed that combines topometric maps with a LiDAR-based perception system for local navigation. The global topological localization and the corresponding graph search are performed based on the open street map (OSM) data. A decision-making architecture is proposed in [artunedo2019decision] that obtains a global route from OSM and generates driving corridors, which are then adapted and bounded using a vision-based lane detection algorithm and a probabilistic grid-based corridor reduction. The drawback of such solutions, however, is that they cannot reason about roads absent in the initial coarse map.

A class of methods focus on road mapping from aerial images, in which pixel-level segmentation is typically combined with graph-based optimization [mnih2010learning, mattyus2017deeproadmapper, marmanis2016semantic]. Such approaches, however, are usually adopted to merely provide local information about the presence of roads. The detailed information including the inter-connectivity of road segments is provided later through an error-prone post-processing stage. To eliminate such intermediate representation, some methods expand the road tree based on certain footprints or produce the road network directly from a CNN [bastani2018roadtracer]. Although such road topologies are very useful for routing purposes, in the context of autonomous driving, they do not provide the level of detail and accuracy required for safe localization and planning.

In the domain of offline mapping, instead of modeling the geometry of each lane, [homayounfar2019dagmapper] proposes to parameterize the unknown lane graph as a Directed Acyclic Graphical model (DAG) and predicts structured outputs such as a polyline. In such approaches, the data is usually collected by driving each path multiple times, and hence provides much denser information compared to online approaches. An online lane detection is proposed in [homayounfar2018hierarchical]

which predicts a structured representation of lane boundaries in the form of polylines. The method exploits a hierarchical Recurrent Neural Network (RNN) to extract them from top-down LiDAR point cloud, where one RNN decides on adding new lanes, while the second RNN predicts the vertices along the lane. The experiments, however, focus mainly on simple topologies in highway scenarios.

3 Hierarchical Road Topology Learning

To facilitate map-less autonomous driving, we herein propose a hierarchical map-learning methodology, which does not suffer from the dependency on HD maps and enables the representation of road topology purely based on the sensory system aboard the vehicle. In this methodology, a road topology is defined as a set of keypoints and their relative connectivity, each of which representing a lane segment. As demonstrated in Figure 2, our model is a multi-stage, multi-task network trained to sequentially create the map components starting from the drivable lanes up to the complete road topology.

3.1 Input Parameterization

We encode the environment around the ego vehicle as rasterized bird's eye view image that contains multiple channels of information: occupancy, ground semantics, ground markings from camera, and ground intensity from LiDAR. The occupancy channel combines measurements from LiDAR and camera that have been semantically classified as obstacle, using Dempster-Shafer rule of combination 

[Dempster1968]. The ground semantics channel accumulates information about the location of drivable road, sidewalk, and terrain. The semantic information used to differentiate ground from obstacle measurements is extracted using camera and LiDAR data in a prior step, following [HernandezJuarez2019SlantedSA] and [piewak2018improved]. The model of [HernandezJuarez2019SlantedSA] is further extended to also infer road markings in the camera image, which are accumulated in the ground markings channel. For all channels, measurements are accumulated not only across sensors, but also over time (using ego-motion correction), to have a single holistic and temporally stable input representation. The result is an image , where is the number of channels. In our experiments, we use an encoding where the vehicle is always positioned at the bottom 14 of the image. Figure 3 shows an example of the channels employed in our proposed approach.

(a) Occupancy
(b) Ground markings
(c) LiDAR intensity
(d) Ground semantics
Figure 3: Input image channels. For 2(a), 2(b), and 2(c)

, higher brightness corresponds to higher existence probability. The data in

2(d) is color-coded to differentiate drivable road, sidewalk, and terrain.

3.2 Scope Definition

Following the intuitive understanding of human's driving horizon, the definition of road topology is limited to the ego vehicle's perception range. This definition applies to all lane segments topologically connected to the the lane segment ego vehicle is currently in, and are reachable by moving forward, right, or left. The topology gets updated as the vehicle moves and receives new measurements. To simulate such behavior in the training data received from HD maps, an orientation constraint is applied where each lane segment should also have an acute angle with the ego vehicle's yaw. Figure 1 depicts the above potential drivable space in green overlayed on all the possible routes extracted from HD map depicted in blue. The corresponding road topology defined for the ego vehicle's potential drivable lanes is displayed in Figure 0(d).

3.3 Stage 1: Feature Extraction

The most basic information required for driving is the potential drivable lanes for the autonomous vehicle. The first stage of our architecture, hence, outlines the rough sketch of such areas in , following the scope defined in 3.2. We adopt an encoder-decoder architecture [chaurasia2017linknet], to aggregate multi-scale features and also preserve spatial information at each resolution.

The network outputs three representations of the drivable lanes with the same spatial resolution as . The location of the lanes are encoded as a truncated inverse distance transform image that labels each pixel in with its relative distance to the closest reference line. To simplify the lane representation and customize the output for future planning purposes, reference line is chosen as the center line of a lane segment in the HD map. In contrast to predicting binary outputs at the lane level, the inverse distance transform of reference line encodes more information about the ideal location of the ego vehicle with respect to the road boundaries. The direction of each lane is represented as , an HSV color-encoded image with continuous values where each pixel in the potential drivable lanes reflects the orientation of the closest reference line. Lastly, the network predicts the perpendicular direction map , encoding the normal directions to the closest reference line. These features are exploited in the later stages of the network to contribute to the hierarchical definition of the map towards road topology prediction (Figure 2).

The parameters of the first-stage model are optimized by minimizing a weighted combination of the reference line detection loss and the direction estimation losses and :


Both the inverse distance transform and direction map estimation tasks are treated as regression. All three losses are defined as the sum of cosine similarity and L1.

3.4 Stage 2: Keypoints Generation

Having an estimate of ego vehicle's drivable lanes and their corresponding features, the second stage serves as an approximation to the baseline of a graph representation of the road by predicting the graph nodes referred to as keypoints , , where represents the corresponding keypoint grid. Towards this goal, a topology graph is characterized by a set of nodes and their connections.
The graph generated based on the driving horizon defined in 3.2 might still be very complex, for example, based on the grid map resolution some nodes might be located very close to each other. In the snapshot understanding of road topology, such level of complexity is not required and can be slightly simplified. Hence, to facilitate the learning process, considering the fact that the standard lane width in the United States is , the resolution of keypoint grid is set to pixel (), where each pixel represents a area of the original perceptive field. To optimize the road graph and reduce the number of parameters and assure that the keypoints in neighboring lanes do not fall into the same grid cell, only one keypoint is kept per cell, which can be either the first keypoint falling into the cell, or the average of all the keypoints within the cell. Also, all the keypoints with a single child that are not a start point will be eliminated. An example of the above pruning process is depicted in Figure 3(a).

We treat the learning problem as combination of segmentation and regression in down-sampled grid space. A lightweight CNN with two 2D convolution layers and 6 residual blocks (Figure 2) is designed to predict the keypoint grid given the outputs of previous stage. The output keypoint grid essentially encodes the probability of existence of a keypoint in each keypoint grid cell and its relative position within the cell, which is consequently used to refine actual position of keypoints in the original resolution. It is noteworthy that there is no distinction between different types of nodes in this process.

To train the network, the sum of losses over the pixelwise sigmoid cross entropy to estimate the likelihood of a cell containing a keypoint and mean square error (MSE) of the keypoints coordinates are minimized:


where represents the predicted total number of keypoints, denotes the ground truth map of the keypoint at pixel location and is the corresponding sigmoid output at the same location.

(a) Input graph topology pruning process
(b) HD map points
(c) Interpolated points
Figure 4: Reference line points definition.
Figure 5:

Graph affinity matrix prediction.

3.5 Stage 3: Keypoints Connectivity and Reference Line Prediction

To complete the graph with the entailing connecting edges to create the road topological structure, the keypoint grid predicted in the previous stage is passed to the third stage of the network to estimate the graph affinity matrix, , where and denote the corresponding connection and lane information of the grid. Hence, in the last stage, in addition to the connections–which is inherently a probability estimation of the existence of a reference line between two keypoints–an accurate localization of the reference line is predicted which is essential to the final task of map-less driving.

3.5.1 Reference Line Definition

For the raw topology generated in Figure 3(a), the definition of a reference line in the HD map might contain a varying number of points depending on the curvature of the underlying line (Figure 3(b)). To simplify the regression task, taking into account the drivable lane orientation constraint explained in 3.2, the reference line is represented as a set of anchor points evenly distributed vertically between two given keypoints. This way, the coordinates of the reference line points can be calculated and the prediction task is reduced to the regression of coordinates only. Figure 3(c) exhibits the above ground truth definition of reference line.

Resolution Distance to ref. line Direction of lane Perp. direction of ref. line Graph keypoints Graph connectivity
MAE SSIM MAE SSIM MAE SSIM Prec. Recall F1 Prec. Recall F1
128128 0.026 0.922 0.016 0.931 0.030 0.883 0.87 0.85 0.85 0.71 0.97 0.79
256256 0.015 0.956 0.010 0.966 0.026 0.878 0.80 0.71 0.75 0.57 0.83 0.66
Table 1: Quantitative evaluation of different stages wrt. grid map resolution.
Complexity (# keypoints) Graph keypoints Graph connectivity
Prec. Recall F1 Prec. Recall F1 Avg. offset (cm)
Easy (1-5) 0.88 0.90 0.89 0.80 0.97 0.86 15
Medium (6-10) 0.85 0.77 0.80 0.61 0.95 0.69 26
Difficult (11-15) 0.72 0.62 0.68 0.40 0.78 0.53 42
Table 2: Quantitative evaluation of stage 2 and 3 w.r.t. scene complexity, defined based on the total number of keypoints.

3.5.2 Graph Affinity Matrix Prediction

Theoretically, the keypoint grid can have up to keypoints. In most of the existing road structures, however, empirically this number has a lower limit . Hence, rather than having a sparse affinity matrix of size to represent all the graph connections, a dense representation of affinity matrix is introduced which keeps the indices of keypoints and their connectivity information. For a given set of keypoints which are represented as a matrix of size labeled with indices , the dense representation would be in the form of two matrices entailing the original connected indices and corresponding reference line information , where denotes the maximum number of points to model a reference line segment. For that, the affinity matrix of predictions is initialized with the fully connected set of predicted keypoints. Accordingly, the stage 3 network predicts a dense affinity matrix which entails connectivity and reference line information. During the loss calculation process, both ground truth and predicted affinity matrix are mapped back to the sparse matrix. This transformation, also, would prevent accumulation of error from previous stage; i.e. in case of having false predictions, this step ensures that all the correctly predicted keypoints are indexed to the right position in the sparse matrix before calculating the loss. Figure 5 depicts this process for both stage 3 ground truth and predictions.

To construct the road topology and refine the road structure estimated from earlier stages, the following loss function is defined for reference line classification and localization:


where denotes the likelihood of existence of connection between keypoints and , represents the coordinates of reference line anchor points, and is the number of connections.

4 Experiments

Input Distance to ref. line Direction of lane Perp. direction of ref. line Graph keypoints Graph connectivity
MAE SSIM MAE SSIM MAE SSIM Prec. Recall F1 Prec. Recall F1 Avg. Offset(cm)
(a) 0.027 0.922 0.017 0.935 0.031 0.885 0.87 0.82 0.83 0.70 0.97 0.78 27
(b) 0.025 0.928 0.016 0.934 0.029 0.896 0.89 0.80 0.84 0.68 0.96 0.77 26
(c) 0.028 0.918 0.018 0.935 0.032 0.884 0.87 0.84 0.85 0.69 0.97 0.77 26
(d) 0.026 0.919 0.016 0.922 0.034 0.879 0.88 0.81 0.84 0.68 0.95 0.76 28
(e) 0.026 0.922 0.016 0.931 0.030 0.883 0.87 0.85 0.85 0.71 0.97 0.79 24
Table 3: Quantitative evaluation of different stages wrt. input grid map channels. In each experiment the following channels are excluded from input (3.1): (a) Occupancy, (b) Ground semantics, (c) Ground markings, (d) LiDAR intensity, (e) None.

The experiments are done on datasets recorded from Santa Clara county, United States. The data has been collected from multiple passes of several autonomous vehicles equipped with the same set of sensors. The datasets consist of 12,000 frames, with 70 forks/intersections. We leave out one route with 1,000 frames for test of generalization. The recordings are categorized for train and validation with 92:8 ratio.

All ground truth labels (potential ego drivable lanes, directional features, and sparse road topology) are extracted from existing HD maps. Hence, our system has to be trained in areas where HD maps are available, with the goal to generalize to new, previously unmapped areas. The benefit of this approach is that is does not require additional human annotation.

Towards the goal of facilitating further research and baseline comparison, we will release the dataset publicly.

Experimental Setup

For stage 1, the model was trained using Adam with a learning rate of 1e-4, with decay rate and step of 0.96 and 1e+5. In stage 2, the original input to the model was downsized by 8. Finally, in stage 3, for both maximum keypoint and maximum connections value of 16 was chosen based on the distribution of training data. As for the reference point definition, we chose a 4 pixel step. Since all the stages are differentiable, the network is optimized end-to-end to predict the parameters and trained for 200 epochs over the entire dataset.

We use Mean Absolute Error (MAE) and Structural Similarity Index (SSIM) as evaluation metrics for stage 1 tasks. For stage 2 and 3, precision, recall, and F1-score are chosen for evaluation. The average offset from predicted reference line points to ground truth is used to evaluate the performance of points position prediction.

Quantitative Analysis

Due to the lack (to the best of our knowledge) of a public road topology benchmark, comparison with other existing approaches could not be undertaken. Nevertheless, we present our results based on evaluations with our ground truth test data. Table 1 presents the detailed evaluation results of each stage for two grid map resolution values of and . As expected, the results of stage 2 and 3 drop by increasing the resolution, since the sensor data gets sparser in the further range and the keypoint prediction has more mis-detections and hence reduced graph connectivity performance. Therefore, for the rest of experiments we only report the values in 128 resolution.

To evaluate the effectiveness of the method towards the ultimate goal of being able to navigate in the areas with no HD map to facilitate self driving at scale, we evaluate the performance of keypoint and reference line point prediction at different scene difficulty levels depending on the complexity of topology in straight roads vs. intersections/forks. As can be seen in Table 2, even though the increasing number of keypoints affects the performance of system, it is still able to recover the underlying topology.

Ablation Studies

Table 3 outlines the fundamental importance of utilizing different sensor modalities represented as different input channels explained in 3.1. By dropping each input channel, we observe a decrease in the model's performance, among which LiDAR intensity plays the most essential role. Table 5 showcases the effect of different stage 1 outputs on the following stages. We can see that excluding direction map or perpendicular direction map leads to much lower performance in stage 3, proving the importance of implicit encoding of direction in the process of learning topology.

Baseline Comparison

To underline the effectiveness of learning connections, we compare our stage 3 network in isolation to a reference method based on the Dijkstra shortest path algorithm. The baseline is described in Algorithm 1. For the results shown in Table 4, ground truth keypoints have been used as input to factor out the influence of stage 2 errors. It can be seen that our learned stage 3 network significantly outperforms the reference method, both in finding correct connections and in reconstructing the correct reference line between two keypoints.

Method Graph connectivity
Prec. Recall Avg. offset (cm)
Proposed method 0.996 0.994 17
Shortest path baseline 0.78 0.78 78
Table 4: Quantitative evaluation of stage 3 compared to the baseline shortest path method.
Output: Connectivity ; Reference lines
Input: Distance transform: ; Keypoint grid:
Data: Distance threshold: (free parameter); Pixel directional steps: = (L, TL, T, TR, R)
// Build pixel graph in cost image
foreach pixel in  do
      foreach pixel step in  do
           pixel = +
           if  then
                connect and with weight
                // Find shortest paths in for each keypoint pair
                Initialize empty connection graph
                for  in  do
                     for  in  do
                          shortest path = Dijkstra()
                          if  exists then
                               connect and with weight Dist()
                               store Path() as ref. line in
= MinimumSpanningTree()
Algorithm 1 Shortest Path Baseline for Stage 3
Figure 6: Qualitative results of road topology predictions on the test data.
Figure 7: Failure cases.
Qualitative Results

Figure 6 depicts results obtained from inference of data coming partially from similar areas in training set but driven in the opposite direction with increasing topology complexity. In row 1 and 2, we showcase how our model correctly infers the change of topology by spawning a new lane boundary at a fork. In rows 3 and 4, we demonstrate the behavior of our model at left turn with the option of u-turn in row 4. Figure 7 displays some failure cases of the system which root in accumulated error of earlier stages or limited sensor information.

Inference Time

For the end-to-end inference of the system the pipeline is optimized to run at 20 Hz. The inference time breakdown for stage 1, 2, and 3 are 7.4, 9.6, and 13.0 (ms), respectively, on a single Tesla P100 GPU. These values are comparable to the experiments with grid map resolution taking 9.0, 8.1, and 15.4 (ms).

Stage 1 output subset Graph keypoints Graph connectivity
Prec. Recall F1 Prec. Recall F1 Avg. offset (cm)
(a) 0.85 0.83 0.84 0.66 0.88 0.74 69
(b) 0.86 0.84 0.85 0.69 0.95 0.77 45
(c) 0.85 0.84 0.85 0.68 0.93 0.76 38
(d) 0.87 0.85 0.85 0.71 0.97 0.79 24
Table 5: Quantitative evaluation of the effectiveness of stage 1 outputs. The output subsets are: (a) Distance to ref. line, (b) Distance to ref. line + Direction of lane, (c) Distance to ref. line + Perp. direction of ref. line, and (d) All.

5 Conclusion

In this work, we have presented an approach that directly estimates structured road topology from an autonomous vehicle's on-board sensors. Our approach copes with various road structures, including one-way streets and large intersections with multiple lanes. Compared to existing approaches, which often require some level of post-processing to improve graph connectivity, fill holes in the prediction through in-painting, need model-based priors or even have a human in-the-loop, our approach estimates the road topology in real time and yields a structure that is directly usable by a behavior planning system, providing an affordable and scalable map solution with fast adaptability and low maintenance. In practice, the predicted map will be utilized by the planning component to define the best trajectory for the autonomous vehicle based on other agent's behavior.
In the future, to boost the generalization, we plan to invest more into procedural generation of vast amounts of topological scenarios in simulation to address corner cases and create more balanced and diverse datasets. Additionally, we will extend the method to include other semantic map elements, such as stop lines and traffic lights.