In recent years, autonomous driving has received much attention in computer vision and robotics research, at both academic and industrial levels. A critical step in autonomous driving is the recognition of the operating environment by the vehicle. Road lane markings form an integral component of this operating environment. In particular, active lane markings serve as significant cues for constraining the maneuver of vehicles on roads by indicating the active lane, which is the single usable road space by the vehicle, that serves as input for lateral steering control to avoid collisions with other road users. Despite the pressing need for accurate and reliable lane detection to enable successful autonomous vehicles, detecting lanes has remained challenging throughout the years. One reason is the rather simple and homogeneous appearance of lane markings which lacks distinctive features. Other obstacles, such as weather and illumination conditions, also plague lane detection research. Furthermore, lane detection scenarios occur in diverse driving environments, various road surface conditions, and in real-time, which necessitates a robust and low computational cost algorithm for successful lane detection on autonomous vehicles.
To address the lane detection problem, deep learning models have gained popularity in recent lane detection literature [Garnett2019, Ghafoorian2018, Gansbeke2019, Hou2020]. Contemporary lane detection algorithms based on end-to-end deep learning models have shown great promise in addressing the lane detection problem [Pan2018, Hou2019, Garnett2019], achieving competitive results against traditional lane detection methods and are more robust to a greater range of driving conditions. However, it is observed that many of these models still do not perform well on datasets that differ significantly from their train sets, e.g. varying road surface conditions and different road lane markings. False positives, otherwise known as noise, undetected lanes, and broken lane edges are common which lead to accuracy degradation and instability in the control of autonomous vehicles in these situations.
In this paper, we aim to address this issue by proposing a robust neural network output enhancement for active lane detection (RONELD) method to strive for a robust, low computational cost and real-time solution for use together with deep learning models on autonomous vehicles. It is motivated by the poor performance of existing deep learning models on new, unseen datasets that makes them problematic to use on autonomous vehicles which rely heavily on accurate lane detection. Our method is built on the observation that accuracy performance can be improved through enhancement of the predicted lane markings from existing deep learning model probability map outputs. In particular, the accuracy performance can be significantly increased on datasets which differ greatly from the train set of the deep learning model. RONELD is intended as a turnkey solution leveraging probability map outputs from deep learning models to optimize active lane detection for more stable and robust active lanes that are better suited for autonomous driving applications. In addition, it is a low computational time solution, making it suitable for real-time use on autonomous vehicles. To verify the usefulness of RONELD, we test it on two state-of-the-art deep learning models, Spatial CNN (SCNN) [Pan2018] and ENet-SAD [Hou2019], and record the resulting accuracy and processing time of RONELD. Our experiments successfully demonstrate the fast runtime and effectiveness of using RONELD on the two popular state-of-the-art deep learning models through the increased accuracy performance. In Fig. 1, we show two simple before and after results of applying RONELD on the SCNN deep learning model probability map output.
The rest of the paper is organized as follows. Section II discusses related work. In Section III, we explain our methodology in four parts: Adaptive lane point extraction, curved lane detection, lane construction and tracking preceding frames. Section IV compares experimental results with benchmarks, and Section V concludes our work.
Ii Related work
Traditional lane detection. Traditional lane detection methods [Wu-2014, Tan2014, Kaur2015, Deusch2012] rely on hand-crafted features such as color-based features [Chiu2005], bar filter [Teng2010], ridge features [Lopez2010], hough transform [Liu2010, Zhou2010], random sample consensus (RANSAC) [Borkar2009, Aly2008], etc.
, to identify lane segments. Tracking techniques such as particle or Kalman filters[Teng2010, Danescu2009, Borkar2009] are used as a final stage for lane tracking to map the lanes onto the current frame. Loose et al. combined the Kalman and particle filters into a Kalman Particle Filter [Loose2009] for lane detection on non-marked rural roads. In general, most of these traditional methods based on hand-crafted features lack robustness and can only solve the lane detection problem in limited scenarios or require strict lane assumptions, e.g. lanes are straight [Li-2015, Niu2016] and parallel [Jiang2010, Nieto2008]. These conditions are not always valid, particularly in complicated urban driving environments or scenes with poor weather and road conditions where issues such as varying road surface conditions (e.g. faded lane markings, discolored road surfaces), different lane marking colors, and visibility significantly impact the accuracy of traditional lane detection methods.
Deep learning based lane detection. After demonstrating compelling results in many other computer vision problems [Zou2020], deep learning methods have been introduced to replace traditional hand-crafted feature-based lane detection algorithms in addressing the lane detection problem [Lee2017, Garnett2019, Ghafoorian2018, Gurghian2016]. One common approach is to treat lane detection as a semantic segmentation task and use end-to-end deep learning models to formulate dense predictions, i.e. predict a label for each pixel in the image to indicate if it is part of a lane marking [Pan2018, Hou2019, He-2016]. There have also been some methods introduced that use an instance segmentation approach as well, i.e. treat each lane as its own instance [Chang2019, Nevan2018]. He et al. introduced a Dual-View CNN (DVCNN) [He-2016] method which uses front and top view images simultaneously to eliminate false positives and remove non-club-shaped structures respectively. Lee et al. proposed a vanishing point guided network (VPGNet) [Lee2017] to address lane detection under low illumination conditions by detecting lane and road markings as well as the vanishing point in a multi-task network. Later on, Pan et al. proposed SCNN [Pan2018], which generalized deep layer-by-layer convolutions to slice-by-slice convolutions within feature maps, thus enabling message passing between pixels across rows and columns in a layer. It is designed for long continuous structured or large objects, with strong spatial relationships but less appearance clues (e.g. traffic lanes). Recently, a self attention distillation (SAD) method [Hou2019] was proposed, incorporated with the lightweight ENet [Paszke2016], ResNet-18 [resnet], and ResNet-34 [resnet] models. In particular, the SAD-incorporated ENet model, titled ENet-SAD, runs 10 times faster than SCNN while achieving comparable performance in popular benchmarks such as CULane [Pan2018] and TuSimple [TuSimple-2019].
Although the aforementioned deep learning methods provide promising lane detection results on trained datasets, their inflexibility presents challenges when road conditions deviate from their train sets. This is also a concern when the lane markings are obscured or degraded due to fading, stains, shadows, or occlusions. These issues makes it difficult for these models to be applied on autonomous vehicles that might encounter new, unseen environments. In Fig. 2, we include probability map outputs from the CULane-trained SCNN and ENet-SAD models on unseen TuSimple test set images to illustrate this issue. Alternatively, the deep learning models require extensive train sets to account for the myriad of different possible environments that autonomous vehicles might encounter which is expensive and time-consuming.
To tackle this problem, some methods using various techniques to enhance lane detection outputs from deep learning models were proposed [Ko2020, Kim2014, Lee2017], usually exploiting some geometric properties (e.g. vanishing points). However, these methods are usually paired for use with specific deep learning models or lack robustness, which make them unsuitable for autonomous driving applications with other existing models.
In this section, We discuss our RONELD method with illustration from the process workflow diagram presented in Fig. 3.
Iii-a Adaptive Lane Point Extraction
We first extract lane points from the probability map output generated from a deep learning model (e.g. SCNN). For robustness and to exclude low-confidence noise, while being able to accurately detect lane points in the context of the current frame’s probability map output, we search only salient points by picking points in excess of an adaptive confidence threshold. This confidence threshold is adapted based on the highest confidence point in the probability map outputs.
After a salient point is found on a detected lane marking, we search the subsequent rows and neighbouring columns, where is the height of the probability map output and is the width of the probability map output. We search only neighbouring points to exclude extraneous objects and noise in other parts of the output as well as to reduce processing times by focusing our search area around the detected salient point. We take the highest confidence point found on the lane within the search area as a lane point for the detected lane marking. This is repeated throughout the probability map outputs to identify lane points for each lane marking.
Iii-B Curved Lane Detection
We separate detected lane markings into two broad categories: straight lanes and curved lanes. We do this to adjust for lanes that do not follow a linear model while being able to use a linear model to fix broken (undetected) lane edges in the probability map output for straight lanes. For straight lanes, we require a minimum of lane points to reduce the impact of noise. For curved lanes, due to their greater complexity, we define the minimum as . To differentiate between the two categories, we use the coefficient of determination
to assess how well the points fit a linear model. The coefficient of determination measures the proportion of variance in one variable that can be explained by a linear regression model and predictor variable(s). It is calculated as follows:
are the random variables denoting the- and -components of detected lane points, is the covariance of X and Y and is the variance of X. We compare the between the whole lane marking and the lane marking without the top points. If the whole lane marking has a lower than the truncated section, it suggests that the lane marking proves a worse fit for a linear model than this truncated section. This, in turn, implies that the top lane points are straying further from the regression line, which is characteristic of a curve. Hence, we mark this lane as a curved lane. Conversely, we mark it as a straight lane if it does not fulfill this criteria. To reduce false curve predictions in our detected lane markings, we corroborate curves in the current frame with curves in previous frames.
Iii-C Lane Construction
A predicted curved lane will have its points connected by quadratic splines to form the final lane marking output. For straight lanes, we attempt to fix broken lane edges in the probability map output and remove outliers by considering the detected lane points based on a linear model with the form:
where , , :
where are the - and -coordinate of the -th detected lane point, are the -intercept and gradient of the line respectively, and is the number of detected lane points.
We obtain a weighted least squared error estimate of,
, based on the sample of detected lane points by adapting weighted ordinary least squares linear regression. For our method, we set the weights as the confidence of each detected lane point in the probability map output of the deep learning model. This is to reduce the problem of heteroskedasticity, as the variance of-coordinates is not constant across the range of -coordinates for the detected lane points, however there is a constraint that the variance for each detected lane point reading is unknown and would depend on factors such as the accuracy of the deep learning model on that dataset. To address this, higher confidence points are assumed to be more accurately detected and hence have smaller errors with lower variances. Therefore, we allocate a higher weight to these points by virtue by their higher confidence on the output probability maps. We search for the solution that minimizes the weighted squared error term for our linear model, where
is the vector containing the weighted error term for each lane point observation, and is calculated as follows:
and is the confidence of the -th detected lane point. By minimizing for our sample of detected lane points, we obtain as follows:
From , we are able to obtain the weighted least squares error gradient and -intercept for the straight lane based on the extracted lane points. To reduce false positive noise from our set of straight lane points, we remove outliers, which are points that are significantly further from the regression line than other points. We do this based on the -distance between each point and the regression line. This step allows us to obtain a more accurate final for the regression line which we store as the straight lane marking parameters. Using these lane parameters, we are able to fix broken lane markings due to undetected lane edges in the deep learning model probability map output.
Iii-D Tracking Preceding Frames
In complicated driving environments with varying weather, illumination and road conditions, the current frame may be insufficient for accurate lane detection. The lanes in the current frame may be obscured or degraded by shadows, poor road conditions (e.g. stains, fading), or occlusions. To address this and minimize distortions arising from incorrectly identified lanes in the probability map, we track lanes in preceding frames and map them to lanes in the current frame to hypothesize stable and robust active lanes. We do this by calculating the root mean square (RMS) -distance, , between previous and current lane markings, shown as follows:
where , are the two lane markings under consideration. , , are the -coordinate, gradient and -intercept of respectively, where . For lanes with , we consider them as the same lane. If there is more than one lane in the preceding frame that matches the current lane, we match the current lane with the previous lane that has the smallest value to ensure that each current lane marking is matched to only one previous lane marking.
We track lane markings and assign them weights based on their appearance in previous frames. We do this to map lane markings onto the current frame even if the lanes are undetected for some intermediate frames, due to reasons such as fading, shadows or occlusions, while remaining robust to changes in the driving environment. Lane markings with a greater number of high confidence points and labelled as potential active lane markings in the current frame are given a higher weight increment due to the increased likelihood of them forming the active lane. Meanwhile, for lane markings that do not appear in the current frame, we decrease their weight exponentially to remain responsive to the changing driving environment. The following equation is used to calculate the weight of lane marking , :
where is the weight increment factor which is higher for identified potential active lane markings and lower for non-active lane markings, is the number of frames in which the lane marking was missing since being detected, is the vector containing the confidence of lane points and is the number of lane points in lane marking in frame , and is all previous and current frames. We identify the potential active lane markings based on the deep learning model output and assign a higher value for potential active lane markings to prioritize them while recording the inactive lane markings for subsequent frames to process. We store inactive lane markings in the current frame in addition to identified active lane markings as lane markings might be identified incorrectly as the active lane marking, e.g. due to false positive lane markings as shown in Fig. 1(c), and the true active lane marking might be misclassified as an inactive lane marking, hence we store inactive lane markings present as well. As the inactive lane markings are lane markings that have been identified in the current image, keeping record of them and assigning them weights helps RONELD have a better understanding of the current lane environment and remain robust to changes in the driving environment (e.g. due to lane changes by the vehicle).
We rank lane markings based on their after processing the current frame. We take one lane marking each from the left and right side of the image with the highest , and mark them as the left and right lane marking for our active lane. Finally, we use the lane marking parameters and the camera’s extrinsic parameters to plot our final lane marking output, using the aforementioned linear model and quadratic spline in subsection III-C for straight and curved lanes respectively.
|CULane||133,235||34,680||1640590||Urban, rural, highway|
Iv Experimental Results
We run experiments on the test sets of two popular and widely used datasets for lane detection, TuSimple [TuSimple-2019] and CULane [Pan2018]. Table I summarizes their details and Fig. 4 contains sample frames from the datasets. CULane has ground truths labelled on all frames and contains many challenging driving scenarios (e.g. congested urban roads and night scenes with poor lighting conditions). TuSimple, on the other hand, is a relatively easy dataset, taken under good or medium weather conditions along highways during the daytime, and only has ground truths labelled on the last frame in each clip of twenty frames. For each frame with ground truths labelled, we manually select the lane markings demarcating the active lane for detection and comparison in our experiments. Some frames in CULane do not have lane markings (e.g. when crossing traffic light junctions) and were ignored in our experiments.
Iv-B Evaluation metrics
Similar to [Pan2018, Hou2019], we calculate intersection over union (IoU) area between the ground truths and predicted lane markings, with line widths set as and pixels respectively, for
output to identify true positive predicted lane markings. The line widths are adjusted based on model output width for uniform comparison across models. We record results for each IoU threshold between 0.3 and 0.5 (inclusive) at 0.01 intervals. Lane predictions with IoU values above each IoU threshold are marked as true positive (TP) lanes for that threshold level. For our evaluation metric, we employ:
where is the number of true positive lanes detected at each IoU threshold and is the number of ground truth lanes. For uniform comparison across datasets, we use the same evaluation metric of accuracy across the different datasets.
|IoU threshold||SCNN||SCNN + RONELD||Acc. Inc.||ENet-SAD||ENet-SAD + RONELD||Acc. Inc.|
|0.3||0.812||0.826||0.014 (1.7%)||0.823||0.832||0.009 (1.1%)|
|0.4||0.762||0.789||0.027 (3.5%)||0.778||0.799||0.021 (2.7%)|
|0.5||0.629||0.703||0.074 (11.8%)||0.655||0.729||0.074 (11.3%)|
|IoU threshold||SCNN||SCNN + RONELD||Acc. Inc.||ENet-SAD||ENet-SAD + RONELD||Acc. Inc.|
|0.3||0.625||0.869||0.244 (39.0%)||0.608||0.825||0.217 (35.7%)|
|0.4||0.470||0.796||0.326 (69.4%)||0.502||0.753||0.251 (50.0%)|
|0.5||0.238||0.549||0.311 (130.7%)||0.341||0.530||0.189 (55.4%)|
Iv-C Implementation details
We exploit two state-of-the-art methods, namely SCNN [Pan2018] and ENet-SAD [Hou2019], for comparison with our RONELD method. The models are pre-trained with the CULane train set and we deliberately do not include any images from the TuSimple dataset in the train set for cross-dataset validation. We use the CULane-trained SCNN and ENet-SAD models to generate probability map outputs on the CULane and TuSimple test set images. From these probability maps, we use the method outlined in [Pan2018] and [Hou2019] to generate lane markings for the SCNN and ENet-SAD model for comparison. The method searches every twenty rows in the probability map, selects the highest confidence point as a lane point, and connects them using cubic splines to obtain lane marking predictions. To obtain lane marking predictions for SCNN + RONELD and ENet-SAD + RONELD, we run RONELD on the same probability map outputs obtained from the CULane-trained SCNN and ENet-SAD models respectively on the CULane and TuSimple test set. We compare the lane marking predictions with the ground truths and compute the corresponding accuracy results at the different IoU thresholds for the various methods.
Tables II-III and Fig. 5 summarize the accuracy performance of our methods, i.e. SCNN + RONELD and ENet-SAD + RONELD, against SCNN and ENet-SAD respectively, on the CULane and TuSimple test sets. Tables II-III show accuracy results and percentage increase in accuracy for the 0.3, 0.4 and 0.5 IoU thresholds, which correspond to loose, medium and strict evaluations respectively. Fig. 5 shows accuracy results for 0.3 to 0.5 IoU thresholds (inclusive) at 0.01 intervals. Some comparative imaging results are shown in Fig. 6, and our discussion of the results are as follows.
CULane results. It is observed that improvements using our RONELD method are not striking on this particular dataset, except the and increase in accuracy on the strictest IoU threshold for SCNN and ENet-SAD respectively. The high degree of similarity between the CULane test and train sets, which are from the same city, explains the general good performance of both deep learning models, with and without RONELD, on the CULane test set. This results in less room for RONELD to improve on the lane detection outputs of the existing models. Furthermore, it is observed that errors in the lane detection outputs are mostly due to incorrectly identified lanes in the deep learning model probability map output, similar to Fig. 1(c), which is difficult for RONELD to address while remaining robust to changes in the lane markings from the probability map output. Despite this, our experiments show that adding RONELD leads to a positive increase in accuracy performance of the SCNN and ENet-SAD models on all IoU thresholds tested.
TuSimple results. We see a more significant accuracy improvement on this dataset after using our RONELD method. It is apparent that the state-of-the-art algorithms do not work well on the unseen TuSimple test set in cross-dataset validation tests, particularly at higher IoU thresholds. By applying RONELD, we are able to achieve compelling results, with a to increase in accuracy on the looser and IoU thresholds. More significantly, the increase in accuracy on the strictest IoU threshold is above 50% for ENet-SAD and two-fold for SCNN. Furthermore, a common issue with the SCNN and ENet-SAD model probability map output on the unseen TuSimple dataset appears to be undetected lanes in intermediate frames and a high degree of noise causing distorted lanes. These arise due to the TuSimple test set differing in some significant ways from the CULane test set (e.g. road surface conditions, types of lane markings). To address these, RONELD uses linear regression for detected straight lanes and outlier removal to reduce noise and through tracking of preceding frames, RONELD can map lanes from previous frame to the current frame to address the problem of undetected lanes in some intermediate frames. As a result, RONELD has a more significant increase in accuracy performance on the TuSimple dataset, with more stable active lanes that are less susceptible to noise. This makes them more suitable for autonomous driving applications as compared to the lanes detected from the deep learning models as shown in Fig. 6(i), (j), (k), (l).
Discussion. It is observed that adding RONELD to the deep learning models improves accuracy performance at all IoU thresholds measured. Interestingly, on the looser and IoU thresholds, SCNN + RONELD achieves a higher accuracy performance on the unseen TuSimple test set vis a vis the CULane test set, while ENet-SAD + RONELD achieves comparable performance on the 0.3 IoU threshold for both test sets, despite both SCNN and ENet-SAD models being trained on the CULane train set. This better performance is explained by TuSimple being a relatively simple dataset compared to CULane as well as each clip in the TuSimple test set containing nineteen preceding frames for all labelled images, with only the last frame in each twenty-frame clip containing ground truth lane markings. This provides RONELD with preceding frames to process before being compared with ground truth frames and highlights the ability of RONELD to effectively utilize lane information in preceding frames.
Ablation study. To investigate our RONELD method and verify its effectiveness, we completed an ablation study to understand the effect of the preceding frame tracking (PFT) step in section III. To test this, we run RONELD with and without PFT by controlling information passing between different RONELD method calls. Our results are shown in Table IV. It is observed that the increase due to PFT is significantly larger at the 0.3 IoU threshold compared to the 0.5 IoU threshold, which is expected as lanes in preceding frames are less likely to have the high accuracy needed to meet the higher threshold, but provide a good estimate of the current lane position. The increase in accuracy is also observed to be more significant on the TuSimple test set vis a vis the CULane test set. This is also in line with expectations due to the greater number of undetected lanes on the TuSimple dataset as it is a cross-dataset validation using the CULane-trained models.
|w/o PFT||w/ PFT||w/o PFT||w/ PFT|
|Dataset||SCNN + RONELD||ENet-SAD + RONELD|
Runtime. To verify RONELD’s fast runtime for real-time use on autonomous vehicles, we measured the average runtime of RONELD on a single Intel Core i9-9900K CPU by taking the mean time needed for RONELD to process the probability maps from the deep learning models as inputs and return lane markings across all images in the test sets, including test set images without labelled ground truths. Using a Python 3 + Numba [Lam2015] implementation, the average runtimes are recorded in Table V. It is observed that the average runtime on the TuSimple test set is noticeably lower than that of the CULane test set. This is due to the lower number of detected lane points per image on the TuSimple test set by the CULane-trained models, as TuSimple is an unseen dataset for the models which are therefore less able to detect lane markings on the TuSimple test set. This is reflected by the significantly weaker performance of the deep learning models on the TuSimple test set, both with and without RONELD. Less points detected per lane image requires less processing by RONELD, resulting in shorter runtimes on the TuSimple test set. The difference in average runtimes on the ENet-SAD and SCNN models can be explained by the larger size of the ENet-SAD probability map. In general, it can be seen that RONELD is a low computational time method, suitable to be paired with deep learning models for real-time use on autonomous vehicles, with overall mean average runtimes of less than 5ms on both deep learning models tested.
We have presented a robust neural network output enhancement for active lane detection (RONELD) method which achieves compelling results in cross-dataset validation tests and shows high potential for use in real-time autonomous driving applications. Using RONELD, we identify, track and optimize active lane detection on probability maps from deep learning based lane detection algorithms. We have demonstrated on two state-of-the-art algorithms, tagged as SCNN + RONELD and ENet-SAD + RONELD, in our experiments on the CULane and TuSimple datasets. Results over the two datasets indicated that by applying RONELD, accuracy increases by up to on the looser and IoU thresholds, and increases up to two-fold on the strictest 0.5 IoU threshold against both SCNN and ENet-SAD algorithms.