In recent years, autonomous driving technology has become increasingly popular in industry and academia. As one of the fundamental guarantees for the application of autonomous driving technology, lane detection has also received much attention in academics. Robust lane detection is one of the foundations for advanced autonomous driving, and a real-time accurate lane detection model can provide lots of useful information to ADS (Autonomous Driving Systems) to assist in vehicle self-control, localization, and map construction.
Existing lane detection methods can be classified into two types, among which the more researched lane detection is considered as a 2D-task driven by image-space. In image-space, lane detection algorithms are based on segmentation in the 2D image space[cordts2016cityscapes, pan2018spatial, neven2018towards, liu2021condlanenet]. In post-processing, the output of the 2D lane model is usually projected to the ground plane by IPM (Inverse Perspective Transformation) using the camera’s intrinsic and extrinsic parameters, and then curve fitting is performed to obtain the lane line based on the vehicle-ego coordinate system. Based on the ideal case of ground plane hypothesis, the output of 2D methods can get the lane line results similar to the vehicle-ego coordinate system after IPM transformation. However, this approach can cause other problems in the actual driving process [bai2018deep, neven2018towards], resulting in inaccurate lane lines in the vehicle-ego coordinate system and inaccurate vehicle control. For example, on real roads, uphill and downhill scenarios are frequent. In such scenarios, the 2D lane methods + IPM approach generates incorrect 3D lane predictions, causing the vehicle in autonomous driving to have incorrect driving behavior.
In order to overcome these problems and make the lane detection algorithm in autonomous driving an industrial-grade product application, more recent research [garnett20193d, efrat20203d, guo2020gen, li2022reconstruct, chen2022persformer] have started to focus on the more complex 3D lane perception domain. 3D-LaneNet [garnett20193d] proposes an end-to-end approach from 2D image space to orthographic bird’s eye view space (BEV). The method unifies 2D image feature encoding, spatial transformation, and 3D lane extraction. Recent works, persformer [chen2022persformer] proposes a framework that unifies the detection tasks of 2D lane lines and 3D lane lines. The work also proposes spatial transformation module, Perspective Transformer, to obtain excellent 3D lane detection and also publish a large-scale real-word 3D dataset, OpenLane. The achievements of these works are remarkable and have shown a wide range of researchers that it is extremely feasible to regress 3D lane detection directly from monocular images using a single model.
2 Related Work
2D Lane Detection.
In recent years, there have been impressive developments in the area of 2D lane detection using deep neural networks, thanks to the effectiveness of CNN models. In the 2D lane detection, it is divided into three directions, pixel-wise segmentation, row-wise segmentation, and curve parameters.[pan2018spatial, zheng2021resa] consider 2D lane detection as a multi-category segmentation task based on pixel-wise, these methods set the limit number of lane lines, and the computing cost is expensive. 2D lane detection is also regarded as two-class segmentation by [li2021hdmapnet, neven2018towards], and then combines with the embedding way to cluster each lane line to achieve the variable number of lane detection. [qin2020ultra, liu2021condlanenet, yoo2020end] focus on the row-wise level to detect the 2D lane lines. By setting the row-anchor in the row direction and setting the grid cell in the colume direction to model the 2D lane lines on the image space, row-wise methods greatly improve the speed of inference. [feng2022rethinking, tabelini2021polylanenet] argue that the lane line can be fitted by specific curve parameters on the 2D image space. So it is proposed that the 2D lane detection can be converted into the problem of curve parameters regressing by detecting starting point, ending point and curve parameters. However, these methods based on 2D image need to combine camera intrinsic and extrinsic parameters for IPM projecting to the ground in post-processing, which are based on the ideal situation of the ground plane hypothesis. As mentioned in Chapter 1, in the actual driving process, 2D lane detection + IPM is difficult to get the exact position of the real lane line.
BEV Lane Detection. In order to obtain more accurate road cognition results, much works have turned their attention to lane detection in 3D space. [garnett20193d, guo2020gen, chen2022persformer, li2022reconstruct] use sota results to prove the feasibility of using CNN network for 3D lane detection in monocular images. [garnett20193d] introduces a unified network for encoding 2D image information, spatial transform and 3D lane detecton in two path-ways: the Image-View pathway encodes features from 2D image, while the Top-view pathway provides translation-invariant features for 3D lane Detection. [guo2020gen] proposes an extensible two-stage framework, which separates the image segmentation sub-network and the geometry encoding sub-network. The specific method is to train the 2D lane segmentation part separately, and combine the camera intrinsic and extrinsic parameters to convert the 2D segmentation mask to IPM project into virtual top-view, and then perform anchor-based 3D lane detection in the virtual top-view. [chen2022persformer] proposes a unified 2D lane detection and 3D lane detection framework, and introduces trasformer into the spatial transformation module to obtain more robust feature expression, which is the current sota in the direction of 3D lane detection, and proposes a real scene based Large-scale annotated 3D lane dataset, OpenLane.
Spatial Transform. A key module of 3D lane detection is the spatial transformation from 2D feature to 3D feature. The spatial transform module[jaderberg2015spatial] is a trainable module that is flexibly inserted into the CNN to implement the corresponding spatial transformation of the input feature map, and is suitable for converting 2D space feature into 3D geometric feature information. In the 3D lane detection, there are four kinds of commonly used spatial transform modules,such as IPM-based methods [reiher2020sim2real, garnett20193d, guo2020gen], but this method relies strongly on the camera intrinsic and extrinsic parameters and ignores the ground surface undulations and vehicle shaking. The MLP-based method [pan2020cross, li2021hdmapnet] has robust performance and is easy to deploy, but it is a fixed spatial mapping, which are difficult to be integrated with the camera intrinsic and extrinsic parameters. Transformer-based spatial transformation module[chen2022persformer, li2022bevformer] has better robustness, but it is not easy to deploy due to the large amount of computation. The space transformation based on deepth[philion2020lift, huang2021bevdet] has a large amount of calculation and thus is not suitable for deployment.
In view of the issue above, we propose our 3D lane detection method, which has three contirbutions. Firstly, we introduce a data preprocessing module based on virtual camera, with transfroms the various cameras with different intrinsic and extrinsic parameters into a unified camera by homography matrix[detone2016deep].Experiments show that this module has better allowed the network to extract the robust features of 2D images. Second, a pyramid module based on MLP spatial transformation is proposed. The above mentioned MLP spatial transformation performance is robust and easy to deploy. Inspired by ASPP[chen2017deeplab], we also adde the concept of spatial feature pyramid. Experiments show that this method is very effective. Combined with the excellent work in target detection [redmon2016you, tian2019fcos, zhou2019objects], different from the way of constructing anchors in [ko2021key, qu2021focus, garnett20193d, chen2022persformer], we propose to add bin+offset to 3D lane detection.We find that these methods can better reduce the y-error and make the 3D lane detection more robust through experiments in OpenLane and Appolo dataset.
X As show in Figure.LABEL:redmon2016you, the whole network architecture consists of four parts:
Virtual camera: preprocessing of unified camera intrinsic and extrinsic parameters.
2D backbone: 2d feature extractor.
Spatial transform pyramid: transform 2d feature to bev feature
Yolo style head: head of detector base on Yolo[redmon2016you]
As show in Figure.1. Firstly, The input images are transformed into virtual camera with fixed intrinsic and extrinsic parameters through their intrinsic and extrinsic parameters. This preprocessing module can quickly project the intrinsic and extrinsic parameters of different cameras to the virtual camera with fixed intrinsic and extrinsic parameters. Then we use a 2d feature extractor to extract the features of 2d image. We carried out experiments with resnet18 and resnet34 respectively[he2016deep], and the specific results can be seen in Table. 4. In order to better promote the network to extract 2d features, 2d lane line auxiliary supervision is added to the output part of the backbone network. Inspire by [chen2017deeplab], Then we design a fast multi-scale spatial transformation module based on [pan2020cross], which we call spatial transform pyramid. This module is responsible for spatial transform from 2d to bev. Finally, we predict the lane on the bev plane. We divide the ground coordinate system into several cells, one of which represents an area of ( defaults to 0.5). Inspired by the Yolo[redmon2016you], we regress the confidence of each cell in the bev plane, the cluster embedding used for clustering, the offset from the cell center to the lane in row direction and the average height of each cell. In the inference, We use a fast clustering method to fuse the results of each branch into lanes.
3.1 Preprocessing of unified camera intrinsic and extrinsic parameters
The intrinsic and extrinsic parameters of different vehicles are different, which has a great impact on the results of the model. Different from the method of integrating the camera intrinsic and extrinsic parameters into the network [chen2022persformer, philion2020lift], we realize a preprocessing method of quickly unifying the camera’s intrinsic and extrinsic parameters by establishing a virtual camera with fixed intrinsic and extrinsic parameters.
Because the 3D lane detection task pays more attention to the plane with of the ground coordinate system. Therefore, we use the coplanarity of homography to project the current camera to the virtual camera through the homography matrix . Therefore, the virtual camera with fixed intrinsic and extrinsic parameters is used to unifying the intrinsic and extrinsic parameters of the cameras.
As show in Figure.2, The intrinsic parameters and extrinsic parameters of the virtual camera are fixed, which are derived from the mean value of the intrinsic and extrinsic parameters of the data set image. In the training and reasoning stage of the network, the homgraph matrix is calculated according to the camera intrinsic parameters and extrinsic parameters provided by the current camera and the intrinsic and extrinsic parameters of the virtual camera. We refer [homographies] to calculate , First, select four points where on the bev plane and project them to the current camera and virtual camera respectively to obtain and . Through establishment, is obtained by least square method.as show in Eqn.1. The effect of virtual camera can refer to Table. 4
3.2 MLP Based Spatial Transformation Pyramid
Because transformer, depth-based and other methods are computationally expensive and unfriendly to autopilot chips, our spatial transformation model is based on MLP[pan2020cross]. In addition, we designed virtual camera to overcome the problem that MLP cannot integrate the camera intrinsic and extrinsic parameters. At the same time, through experimental analysis, different 2d feature locations have different effects on MLP. Table. 5 illustrates this problem well. Among them, the low-level feature layer has a large amount of computation and poor effect. We guess that this is because MLP is a fixed spatial mapping and is not suitable for high-resolution features. Inspired by the feature pyramid[chen2017deeplab], we designed a spatial transformation pyramid based on MLP. We use 1/64 resolution feature s64 and 1/32 resolution feature s32 to do spatial transform respectively, and then concatenate the results of both.
where denotes MLP of s32, denotes MLP of s64, denote pixel on bev feature.
3.3 Yolo Style Head
Different from 3d anchor and other design methods, we propose a method to predict 3d lanes in the bev plane by referring to the Yolo[redmon2016you]. As shown in the red box in Figure.1, we divide the plane with height 0 of the ground coordinate system into several cells, where each cell represents ( defaults to 0.5). We regress the confidence of each cell including the lane, the lane embedding for clustering, the offset from the cell center to the lane in row direction and the average height of each cell. Unlike Yolo, we get uncoupled heads for each head.
Cell size has a great influence on the prediction results. If the cell size is too small, the positive and negative samples will be out of balance. If the cell size is too large, There will be an overlap between different lanes. The prediction results of cell size on lanes can be seen in Table.5. Considering the sparsity of lane tasks, we recommend that the cell size be through experiments. Therefore, we predict the lanes of -10 to 10 in the direction and 3 to 103 in the direction under the ground coordinate system, and the corresponding prediction figures H and W of bev are 200 and 40 respectively. Therefore, four 200 * 40 heads, confidence, embedding head and offset are output from the bev lane detection head, height head. Where confidence head, embedding head, and offset head generates the instanced lanes under the bev, as show in Fig. 3.
Similar to Yolo[redmon2016you]
, the confidence of lanes is a binary branch. Each pixel represents the confidence of the cell containing lanes. We interpolate the lanes in the visible area of the training data, and the confidence of the pixel corresponding to the cell where the lane pass es is 1, otherwise it is 0.The confidence loss can be expressed by binary cross entropy loss.
where denotes the prediction of the model and denotes ground truth.
Similar to Yolo [redmon2016you], Offset is responsible for predicting the offset on
direction from the cell center to the lane. The cell predicted by Confidence branch contains the probability of lanes. With the increase of the cell size, the offset becomes larger and larger. As shown in Fig.3. similar to Yolo, a normalized offset is used to represents the distance of the grid cell center point and lane key point in the row direction.
where denotes if lane through this cell. denotes Lateral error from prediction and denotes Lateral error from the ground truth.
To distinguish the lane identity of each pixel in the confidence branch, we predicted the embedding feature of each pixel with reference to [de2017semantic]. In the training of the network, the embedding feature draws the lanes with the same ID and pushes the lanes with different IDs through measurement learning. In the inference of the network, we use a fast unsupervised clustering post-processing method to predict the variable number of lanes. Unlike 2d lanes that usually converge at the far end, the parallel orthogonality of 3d lane is more suitable for the embedded feature mode.
where denotes mean of per lane.
denotes variance of different lane.
Because our task is to predict 3d lanes, confidence and offset can only predict the key points and of the 3d lane, so the height branch is responsible for predicting the height coordinate of the key points of the 3d lane line. In the training phase of the network, we use the average lane height in a grid cell as the true value of the lane height. At the same time, and offset branch is the same as height branch. Only the positive samples are counted in the loss.
where denotes Lateral error from prediction and denotes Lateral error from ground truth.
3.3.5 Total loss
The total loss includes 3d lane loss and 2d lane loss. The 2d lane loss includes 2d lane segmentation loss and 2d lane clustering loss.
where denotes 2d lane segmentation loss and denotes 2d lane embedding loss.
In order to verify the performance of our work, our model is teste on the real-world data openlane dataset and the simulation dataset apollo dataset. Compared with previous methods, including Persformer[chen2022persformer], reconstruction from top view[li2022reconstruct], Gen-Lanenet[guo2020gen], 3D-Lanenet[efrat20203d], CLGO[liu2022learning] etc., it is proved that our work is based on Both and F1-score can reach the level of SOTA
4.1 Evaluation Metrics and Implementation Details
Our evaluation metrics are followed by Gen-Lanenet[guo2020gen] on both 3D datasets, which include F1 score in different scenes and error in different areas and height error error.
Comparison with other open-sourced 3D methods on OpenLane. Our method achieves the best F1-Score on the entire validation set and every scenario set
4.2 Results on OpenLane
Openlane contains 150000 training data and 40000 test data. In order to verify the performance of the model on every scene, the Up&Down case, Curve case, Extreme Weather case and Night are split from the verification set case, Intersection case, and Merge&Split case. Table.1
shows the F1-score of the model in every scene. Our model trains 10 epochs in the training set and achieves sota in each scences. Table.2 shows the specific performance in F1-score, error and error in different work.
|Method||F-Score||X errror near||X error far||Z error near||Z error far|
4.3 Results on Apollo 3D Synthetic
The data on Apollo[guo2020gen] includes 10500 discrete frames of monocular RGB images and corresponding 3D lane ground truth, which are split into three scenes: balanced, rally observed, and visual variation scenes. Each scene contains independent training sets and test sets. It should be noted that because our virtual camera needs to use the external parameters of the camera, Apollo does not provide specific external parameters of the camera. We calculate the intrinsic and extrinsic parameters of the camera through the height and pitch of camera provided by the dataset.
In Table.3 below, we give a comparison between the work of the foreword and our work. Our model has trained 80 epochs on these datasets, and F1 core and x error The errors have reached sota, z Error also achieves competitive performance.
|Reconstract from top||91.9||0.049||0.387||0.008||0.213|
|Reconstuct from top||83.7||0.126||0.903||0.023||0.625|
|Reconstuct from top||89.9||0.06||0.446||0.011||0.235|
4.4 Ablation Study
The experiment in this section will be carried out on Openlane, and the evaluation metrics is still based on Gen-Lanenet[guo2020gen]. We put per cell, MLP[pan2020cross] to do spatial transformation, and without Virtual Camera method as the baseline. We prove the effectiveness of our methods by adding three methods: Virtual Camera (VC), Spatial Transform Pyramid (STP), and Yolo Style Head(YSH) on Table. 4. In addition, we also compared the effects of different backbones and 2D auxiliary supervision on the experimental results. At the same time, in order to verify the imfluence of different output resolutions on Yolo Style Head, we added a Table. 5.
|Method||F-score||X error near||X error far||Z error near||Z error far|
|+ virtual camera||53.3||0.324||0.695||0.330||0.705|
|+ 2d det||54.5||0.338||0.689||0.258||0.614|
|+ YSH without offset||57.9||0.429||0.734||0.243||0.630|
|Method||F-score||X error near||X error far||Z error near||Z error far|
|0.05m without offset||43.2||0.345||0.770||0.261||0.700|
|0.2m without offset||55.7||0.321||0.701||0.253||0.624|
|0.5m without offset||57.9||0.429||0.734||0.243||0.630|
|0.5m with offset||58.4||0.309||0.659||0.244||0.631|
|1m without offset||56.8||0.607||0.856||0.241||0.593|
|1m with offset||57.7||0.317||0.671||0.245||0.590|