In autonomous driving system (ADS), lane detection plays an important role. On the one hand, the location of host and other traffic participants in the lane forms the basis of autonomous driving decisions. On the other hand, the geometry of a lane marker can be viewed as an important landmark
of the environment and aligned with a high-resolution or vector map for high-precision positioning. At the same time, lane detection has been widely used in Advanced Driver Assistance Systems(ADAS) and is the basis for some common features such as Lane Keep Assist (LKA) and Adaptive Cruise Control (ACC).
Recent advances in lane detection can be attributed to the development of convolutional neural networks(CNN). Most existing methods adopt well-studied frameworks such as semantic segmentation and object detection to parse lane markers and transform the network output into parametric curves through post-processing. However, the mostly used frameworks can not be seamlessly generalized to curved-shaped lane lines because lane detection task requires precise representation of local positions and global shapes simultaneously, showing their own limitations.
The semantic segmentation-based approach predicts binary masks of lane marker regions, inserts clustering models into training and inference, groups masked pixels into individual instances, and finally uses curve fitting to parametric results. However, the clustering procedure complicates the training and inference pipeline. In addition, pixel-level inputs to curve fitting are often redundant and noisy, all of which bring negative impact to the accuracy of the final results. Fig.1(a) shows several cases where the prediction errors may increase. Object detection approaches are originally designed for compact target and produce bounding box as output, which is insensitive to pixel-level error when faced with large-scale object. As for lane markers, they typically span half or more of the image, and pixel-level localization errors significantly impair detection performance, which can be attributed to the limited field of view (FOV) of features learned through CNN being insufficient to model content that is too far apart. Fig.1(b) illustrates the effect of FOV in complex scenario. Moreover, most of these solutions model global geometry directly, and the network must produce high-dimensional outputs to describe the curves. Theoretically, however, uncompact outputs increase the demand for data and model capacity, ultimately masking the generalization ability of the resulting model.
Although the global structure of lane markers has some complexity, we note that local lane markers are extremely simple and that global lane markers can be approximated by a combination of local line segments. Moreover, spatial locality is more suitable for modeling with CNN. Following this intuition, a novel lane marker detection method, FOLOLane, is proposed that focuses on modeling local geometry and integrating them into the global results in a bottom-up manner. Specifically, the geometry of the lane marker is predicted by estimating adjacentkeypoints
on the it. In the bottom stage, a fully convolutional network is used to capture keypoints in the local scope through two separate heads. The first one gives the probability that keypoints appear in pixel space, and the second one gives the offset between keypoints and the most spatially correlated local lane marker, which is used to refine the positions of keypoints generated by the first head and construct associations between keypoints on the same lane markers. Based on the local information, two decoding algorithms with different preferences are proposed to predict global geometry of lane markers. The bottom-up pipeline of the proposed method is shown in Fig.2.
Compared with existing works [chen2019pointlanenet, CurveLane-NAS, qin2020ultra, neven2018towards, pan2017spatial], the proposed approach concentrates the capabilities of CNN on a local scale, which is suitable for CNN’s limited FOV, and significantly reduces the complexity of the task and the dimension of the output. As a result, the compact output leads to stable and efficient training without additional effort in network architecture design and data collection. Considering the continuity of lane markers, the proposed decoder is able to associate keypoints of the same instance and optimize the geometry of network predictions without affecting performance and efficiency. Furthermore, during network training and instance decoding, we model and predict keypoints using features with the highest spatial correlation guided by coarse-to-fine strategies. The proposed bottom-up solution achieves the best state-of-the-art level, Acc: 96.92% on TuSimple and F1 score: 78.8% on CULane, and excellent generalization in the two public datasets. Together with the compatibility with network architectures, our approach shows a promising application future.
We emphasize that our method is the first to formulate lane detection into multi-key-points estimation and association problem, which is inspired by the bottom-up human pose estimation framework[pishchulin2016deepcut, cao2017realtime, efrat2020semi]. The proposed local scope based method avoids the inaccurate prediction where far from the anchor, which occurs in detection-based methods. And the sparsity of key points prevents the noisy and redundant output occurred in segmentation-based methods, which decrease the precision and increase the delay of curve fitting. With extensive experiments, our solution proves the potential of applying pose estimation approaches on lane detection, which opens up a new direction to solve this important application problem. Our solution does not depend on CNN architecture, is readily compatible to newly developed architecture and shows scalable potential on accuracy and efficiency.
Our contributions can be summarized as follows:
Lane detection is firstly decompose into subtasks of modelling local geometry, which is achieved by estimating keypoints on local curve. Simplified targets and focus on spatially limited scope helps the network to provide precise estimation of local curve.
Two decoding algorithms with different preferences are designed to integrate local information into global prediction, which enable the system to achieve high accuracy in ultra real time.
Experimental results showed that our approach outperforms all existing methods by a substantial margin. Besides, our model shows the best generalization ability in comparison, which further proves the potential for productization.
2 Related Work
Lane Marker Detection.
Lane marker detection based on deep learning can be categorized into two groups: detection based and segmentation based. The former one:[chen2019pointlanenet] proposed an anchor-based lane marker detection model for forward-looking cameras. Lane markers were uniformly sampled along the vertical axis in the image, and dense regression was performed by predicting the offset between each sample point and an anchor line, then Non-Maximum Suppression(NMS) was applied to suppress the overlapping detection and select the best lane marker with the highest score. [CurveLane-NAS] proposed the use of neural architecture search(NAS) to find a better backbone and a point blending based post processing to further improve the performance of lane marker detection task. [ko2020key] proposed to train a CNN to predict the existence, position and feature embedding of lane markers in an image. A lane marker instance was clustered based on the trained feature embedding. [qin2020ultra] formulated lane marker detection as a pixel-wise classification problem for each row of an image. A specific feature map was predicted to indicate the position of a lane marker on each row.
Segmentation based: [lee2017vpgnet] proposed a multitask framework, which predicted pixel-wise multi-label and clustered the pixels belonging to same lane instance in bird eye view image using DBSCAN. It also added an auxiliary task: vanish point estimation, to increase the stability of lane marker detection. [neven2018towards] proposed an end-to-end joint semantic segmentation and feature embedding network architecture. Pixels on the same lane marker were assigned an identical instance id. [pan2017spatial] also designed an instance segmentation network for lane marker detection problem. Different from [neven2018towards], [pan2017spatial] predicted a probability map for each lane marker separately and used cubic splines to fit it. In stead of using pixel-wise classification, [yoo2020end] introduced a row-wise classification architecture. For each row, it predicted the most possible grid of a lane marker in an image and recovered a lane marker instance through post processing. [liu2020lane] proposed a CycleGAN based method to enhance lane detection performance in low light conditions. [ghafoorian2018gan] claimed a more accurate method by using EL-GAN for lane marker detection, which used a generator to segment the lane markers and a discriminator to refine the segmentation result. [hou2019learning] proposed a self-attention distillation method for lane marker segmentation task by forcing shallow layers to learn rich context feature from deep layers.
Bottom-Up Human Key Point Detection. [pishchulin2016deepcut] proposed a bottom-up method for crowded scenes, which detected keypoints and built a densely connected graph, the weight of each edge represented the correlation of two keypoints. By optimizing the graph, keypoints belonging to one person were clustered. [cao2017realtime] predicted a heat map for each keypoint and part affinity fields (PAFs) which were used to associate body parts with individuals in the image. Similar to [efrat2020semi, neven2018towards], [newell2017associative] introduced feature embedding to facilitate keypoints clustering of one person while predicting the heat map of keypoints. [papandreou2018personlab] further split the problem into two stages: (1) predicting heat map and short-range offset for keypoints detection, (2) clustering key points using mid-range offset for one person.
We find that lane marker detection can be abstracted as discrete keypoints detection and association problem, which is very similar to bottom-up human key point detection task. [philion2019fastdraw]
proposed a method based on this idea. A network was trained to extract all possible lane marker pixels and output the pixels in the neighboring row, which belongs to the same lane as the current lane marker pixel. As the problems discussed above, the inherent segmentation-based method inhibited the precise representation of a lane marker. In addition, the pixel-wise joint distribution prediction was redundant.
As shown in Fig.2, we proposed a bottom-up lane detection method by estimating the existence and the offsets of the local lane point through the network, followed by a novel global geometry decoder to generate the final curve instances.
3.1 Network for local geometry
In the proposed approach, each predicted marker curve is represented as an ordered keypoints set, where the key points are of fixed/predefined vertical interval across neighboring rows. First of all, the task of curve prediction is decomposed into local subtasks via a fully convolutional network with two heads. The heatmap outputted by the first head expresses the possibility that keypoint appears, which resolves the existence of local curves. The second head predicts offsets to key points of the most closed local curve, which describes the precise geometry of the local curve.
Key point estimation. Motivated by a curve constituted of points, we adopt a keypoint-estimation-based framework. The network firstly outputs a heatmap with the same resolution as input, which models the probability that pixel is a keypoint of the curve. In the training phase, the points set as annotation of the
curve are interpolated to be continuous in pixel space as. Each pixel of the curve
is considered as a key point and yields ground-truth value for neighbors via unnormalized Gaussian kernel. The standard deviationdepends on the scale of input, and if the ground-truth value of some pixel is assigned by multiple keypoints, the maximum will be kept.
To deal with the class imbalance problem coming with the sparsity of key points, we employ penalty-reduced focal loss for this head as in [law2018cornernet, zhou2019object], where only pixels with ground truth equal to 1 are considered positive and all others are negative. The penalty from negative pixels arises with the distance to positive, which helps to reduce the influence of ambiguity. We denote the output of pixel at heatmap as and the ground-truth value assigned by Gaussian Kernel as . Define penalty coefficients and as:
and the loss function for heatmap head is constructed as:
are tunable hyperparameters, controlling the penalty reduction for ambiguous and simple samples respectively.is the number of key points in the current image.
Compared with segmentation-based methods, the loss function Eq.2 guides the network to learn positive and negative samples of keypoint with reduced supervision from the total pixels, prompting pixels best suited for expressing geometry to the response. An example of the heatmap can be found in Fig.2 as the first output of the network, the center of lane marker responses highest, and the neighborhood became colder gradually, which helps prevent the noise and redundancy from propagating to subsequent procedures as well.
Local geometry construction. For precise geometry, the second head of the network regresses a vector , describing the local geometry of the closest curve to pixel . The elements indicate the horizontal offsets to 3 neighboring key points with fixed vertical interval , which have been colorized for visualization in Fig.2. Given the vector, we can simply recover the local curve related to pixel :
where , and denote the actual location with fixed vertical interval to pixel , respectively.
In the training phase, all pixels within a fixed distance from key points of the curve , , are taken to compute loss for .
where denotes the function retrieving vertical coordinate of the pixel, is function retrieving horizontal coordinate of curve on specific row .
For , a coarse-to-fine strategy is employed:
where the training pixels come from the decoded prediction of and in Eq.4, which is used to compensate for the error in predicting and and keeps in line with the coarse-to-fine behavior in the decoding stage. L1 loss is employed for all the regression terms.
Network architecture. To justify the effectiveness of focusing on local geometry, we adopt light-weight architecture ERFNet [romera2017erfnet] and BiSeNet [yu2018bisenet]
, which were originally designed for semantic segmentation on mobile devices. During the feature extraction, the encoder abstracts image into downsampled feature map, then the decoder broadcasts the high-level semantics to the same resolution as input. All 4 logits are yielded by the last block of the decoder for saving memory. Most experiments in this paper are performed basing on ERFNet. Since the method is designed for working in real traffic scenarios, which is required to handle the case of a merged or split marker and any number of instances, there is no extra branch specialized for predefined lane markers as in[pan2017spatial, hou2019learning]. The final cost function is formulated as
3.2 Decoder for global geometry
In the above section, CNN produces pixel-wise heatmap and offset for keypoints in local scope. These local information are subsequently integrated into prediction of global curve. Specifically, the heatmap is used to determine emergence and termination of curve. The offsets is used to associate keypoints on same curve instance and refine geometry further. To this end, we propose two novel and simple algorithms for decoding the output of CNN under different demand scenarios, which responds to preferences for accuracy and efficiency respectively.
Greedy decoder works through iteratively extending the neighbors of keypoint in a greedy search-like manner. For each input image,
Find the row containing greatest number of local maximum response on heatmap. This row and the points are taken as starting line and current keypoints.
Refine the position of current keypoints. For point , refinement can be formulated as .
Explore the vertical neighbors of current points, the coordinates of which can be computed as and .
Examine the heatmap value of and . If the value reaches threshold , the corresponding neighboring points is used to update current keypoint, and Step2~4 are repeated. Otherwise the search is terminated, all the points searched from one single point are taken as one global curve.
To sum up, the decoding algorithm gradually extends the global curve by exploring neighbors of keypoint, and refine the geometry of curve in a coarse-to-fine manner. This algorithm can produce precise geometry of curve, but its low efficiency limits the useability in practical application. The process have been shown in color in Fig.4.
Efficient decoder is proposed in order to solve the inefficiency problem of greedy decoders, which utilizes the parallelism of computing devices. For each image,
Extract rows at equal interval on heatmap. On each row, take the points with local maximum response as current keypoints.
For each keypoint , compute three related points as , and .
Construct association among current keypoints located in neighboring rows. For a point in -th row, two points in -th row and -th row will be associated with it, which are closest to the position of and respectively.
Starting with the row with maximum number of current keypoints. According to the association relationship created in Step3, for each current keypoint, all the keypoints associated with it in above/below rows are iteratively taken out as a single group. Each keypoint group is considered as a global curve, and of points are used to refine geometry of curve further.
The efficient decoding algorithm leverages the parallel computing power of device, to create association among keypoints and refine their position, from step1 to step3. Step4 involves only index operations, thus the time overhead is very low. The process have been shown in color in Fig.3.
In this section, firstly we describe the implementation details and evaluation datasets. Followed by the results of comparison with the state-of-the-art, including quantitative and qualitative results. Finally, the discussion of ablation study and generalization are detailed.
|Dataset||# Frame||Train||Validation||Test||Resolution||Road type||# Lane|
|CULane||133235||88880||9675||34680||1640590||urban, rural and highway||4|
4.1 Implementation Details
We first resized the width of an image to 976 and kept the aspect ratio on both datasets. The was set as 10 pixels for a trade-off between precision and efficiency. The weight for loss function in Eq.[6
] was set as 0.02. For optimization, we used Adam optimizer and poly learning rate schedule with an initial learning rate of 0.001. Each mini-batch contained 16 images per GPU and we trained the model using 8 V-100 GPUs for 40 epochs on CULane and 200 epochs on TuSimple, respectively. To reduce overfitting, we used a 0.3 probability of dropout and weight decay with 0.0001. Furthermore, we also applied data augmentation, including random scaling, cropping, horizontal flipping, random rotation, and color jittering, which have been proved to be effective. In the testing phase, we set the threshold of lane existence confidence as 0.5.
As illustrated in Table 1, the basic information of TuSimple and CULane datasets are detailed. And for evaluation criteria, we follow the official metric used in [tusimple] and [pan2017spatial].
In this section, we show the results on two lane detection datasets. In all experiments, ERFNet[romera2017erfnet] is used as our baseline network if not specially mentioned.
Quantitative results. To verify the effectiveness of our proposed method, we compared it with state-of-the-art algorithms based on either segmentation or object detection, including SCNN[pan2017spatial], LaneNet(+H-Net)[neven2018towards], EL-GAN[ghafoorian2018gan], PointLaneNet[chen2019pointlanenet], FastDraw[philion2019fastdraw], ENet-SAD[hou2019learning], ERFNet-E2E[yoo2020end], SIM-CycleGAN+ERFNet[liu2020lane], UFNet[qin2020ultra] and PINet[ko2020key].
As illustrated in Table 2, the proposed method achieves a new SOTA result on the CULane testing set with a 78.8 F1 measure. Compared with the best model as far as we know, PINet(4H), our method outperforms almost all of the scenarios, whose F1 measure improves 4.4%. Because of local occlusions and fogged traffic lines, PINet shows degraded performance in some categories, such as Crowded, Arrow and Curve. Although our method and PINet are both based on key points estimation, in the aforementioned categories, our method outperforms PINet with 5.5%, 5.3%, and 3.8% F1 measure improvements respectively, which indicates our local geometry modeling model and bottom-up pipeline have better lane marker representation capabilities. Besides, an interesting point is that SIM-CycleGAN+ERFNet, which aims at dealing with low light conditions using CycleGAN, is not comparable to our lane marker detection model in the night and dazzle light scenarios, which implies that our approach is of better generalization ability even than GAN augmented data.
The results of different methods on the TuSimple testing set are shown in Table 3. Due to the limited scale (train/test:3.3k/2.8k) and homogeneous scenario (highway), most methods achieved near-saturated accuracy (more than 96%). Despite this, our method still outperforms the 2nd by 0.17%, close to the difference between 2nd and 4th.
Qualitative results. We also show qualitative results of the proposed method and SCNN, SIM-CycleGAN+ERFNet, UFNet, PINet on the CULane testing set. As shown in Fig.5, our method focusing on local geometry and bottom-up strategy helps to distinguish the occlusion of crowded roads and the missing lane marker clues. Through keypoint estimation, the proposed method could yield a smoother and more accurate curve than the others do. Even though in night and dazzle light scenarios, the predicted results are still satisfactory. In conclusion, the proposed method leads to visible improvements in lane marker detection among recently developed segmentation-based and regression-based approaches.
4.3 Ablation Study
To investigate the effects of the locally based designs, an ablation study is carried out on the CULane dataset. The experiments are all conducted with the same settings as described in Sec. 4.1 if not specially mentioned.
|Se.||Ke.||@ test||@ train||Gre.||Eff.||ERF||BiSe|
Key point estimation. Different from segmentation-based solutions, our key point estimation method focuses on the center of the lane marker, which achieves a impressive result. Table 4 shows that the proposed method improves the F1 measure from 74.2 to 76.6, which indicates that the suppression of ambiguous and noisy pixels helps achieve accurate geometry and fewer false positives, improving the performance of a system in turn.
Coarse-to-fine geometry refinement. During both network training and instance decoding, we adopt a coarse-to-fine geometry refinement for a more accurate position of key points. In the training phase, the training pixels come from the decoded prediction of and . In the inference phase, The predicted is employed to refine the position of initial key points and newly explored neighboring key points. The results of different configurations are shown in Table 4. Only using coarse-to-fine in inference improves the F1-measure 0.9%. When coarse-to-fine is extended to training, the performance outperforms that of uniform sampling in significantly by 1.3%. The result shows that the direct prediction leads to suboptimal position estimation and our coarse-to-fine strategy could guide the spatially most related representation to capture the geometry of the curve and achieve a more accurate prediction.
Efficiency-oriented implementation. As mentioned in Sec. 3.2, efficient decoding is aimed at real-time processing. The main difference from a greedy decoder is that the iteration of decoding neighboring key points is replaced by parallel processing. The parallel decoding significantly improves the efficiency, which achieves 16 ms (64%) runtime gains than greedy decoder at the cost of 0.8% performance degradation. The reason can be attributed to the lack of local optimal estimation in each iteration of greedy decoding.
To maximize efficiency for application, we further replace the basic network from ERFNet to BiSeNet, which is a real-time semantic segmentation network originally designed for mobile devices. Since the output of BiSeNet is 8 times downsampled from the input size, real-time performance is achieved by reaching more than 100 fps and 77.5 F1 measure simultaneously, which is still the best state-of-the-art results excluding the accuracy-oriented version of our approach. On the other side, the experiment also proves the compatibility of the proposed system, which can be readily adapted for more powerful and efficient network architectures up to date.
To further verify the generalization of our proposed method, we employ the checkpoint trained from the CULane training set to inference on the TuSimple testing set. To our knowledge, this is the first attempt to investigate the generalization between these two widely used datasets. Table 5 shows that the proposed method achieves obvious superiority with an accuracy of 84.36%, which surpasses other methods by a significant margin of nearly 20%. The SCNN and PINet(4H) approaches suffer most from the generalization ability, which decreases 90% and 60% respectively. The generalized visualization results on the TuSimple testing set are shown in Fig.6. This result indicates that the simplified task and the compact output of the network reduce the demand for model capacity and training data, the resulting stableness and efficiency in training finally lead to advantageous generalization to other domains, which shows promising potential for application.
5 Conclusion and Future Work
In this paper, we propose a local-based bottom-up solution for lane detection. Experimental results show the keypoint estimation and the coarse-to-fine refinement strategy circumvent the influence from ambiguous and noisy pixels, effectively improves the accuracy of curve geometry. More importantly, the principle of focusing on local geometry and the bottom-up pipeline have been proved to be particularly resultful, which significantly simplifies the task by reducing the dimension of the output of CNN and is believed to be the principal cause of the excellent performance and generalization capacity.
The proposed method also shows superiority in adaptation to the rapid evolution of neural networks for performance and efficiency. We have plan to incorporate more powerful architectures intoFOLOLane framework, e.g. the ones with self-attention mechanism, to improve the performance further. We also want to use FOLOLane on MindSpore111https://www.mindspore.cn/, which is a new deep learning computing framework. These problems are left for future work.