To achieve fully autonomous driving, it is required to understand the environment around the car, and various perception modules are fused for the understanding. Lane detection module is one of the main modules. It is included in not only the fully autonomous driving system but also partial autonomous driving system already developed such as Advanced Driver Assistant System(ADAS) and the cruise control system. There are many modules that are derived from lane detection module such as Simultaneous Localization And Mapping(SLAM), lane centering function in ADAS, etc. Despite various sensors can be used for the perception modules, a RGB camera is considered one of the most important sensors because it has a low price. However, because a RGB image has only pixel color information, some feature extraction method should be applied to RGB image for the inference of other useful information like lane, object location, etc.
Most traditional methods for lane detection extract low-level features of lanes firstly using various hand-craft features like color(e.g. , ), edge(e.g. , ), etc
. These low-level features can be combined by Hough transform  and Kalman filter , and the combined features generate lane segment information. These methods are simple and can be adapted to many various environments without major modification, but the performance of these methods depends on the test environment such as the lighting conditions and the occlusion.
Deep learning methods have outstanding performance in the complex scene. Among deep learning methods, Convolutional Neural Network(CNN) methods are especially used in the field of computer vision. Many recent methods apply the CNN for the feature extract [7, 8]. Semantic segmentation methods are frequently applied to the lane detection problem for inference about the shape and the location of lanes[9, 10]. These methods distinguish instance and label of the pixels on whole image. Despite these methods achieve outstanding performance, they can only be applied to the scene that consist of the fixed number of the lanes because of their multi-class approach to distinguish each lane. Nevenet al.  cast this problem to the instance segmentation. LaneNet that they proposed has a shared encoder for feature extraction and two decoders. One of them performs the binary lane segmentation, and other is embedding branch for instance segmentation. Because LaneNet applies the instance segmentation method, it can detect the arbitrary number of lanes. Chen et al.  propose the network that predicts directly x axis values for the fixed y values on each lane, but the method only works on the vertical lane detection.
Our proposed method predicts the fewer exact points on the lanes than input pixel size and distinguishes each point into each instance. The hourglass network  is usually applied to the field of the key points estimation like pose estimation  and object detection [15, 16]. The hourglass network can extract information about various scales by sequences of down-sampling and up-sampling. If some hourglass networks are stacked, loss function can be applied to each stacked network, and it help more stable training. The instance segmentation methods in the field of the computer vision can generate clusters of the pixels that belong to each instance [17, 18].
Camera-based lane detection has been actively developed, and many state-of-the-art methods work almost completely in some public data sets. However, these methods have some weaknesses like the limited number of lanes that the module can detect and high false positive. The false negatives, the lanes that the module misses to detect, do not change control value suddenly, and correct control values can be calculated by other detected lanes. However, the false positive can lead to serious risks. The false positives, the wrong lanes that the module alarms, can cause the rapid change of the control values.
Fig.1. shows the proposed framework for lane detection. It has three output branches and predicts the exact location and the instance features of the points on the lanes. More details are introduced Section II. In summary, there are primary contributions of this study: (1) We propose the novel method for the lane detection that has more compact output size than semantic segmentation based methods, and the compact size can save memory of the module. (2) The proposed post processing method can eliminate outliers successfully and increases the performance. (3) The proposed method can be applied to various scenes that include any oriented lanes like vertical or horizontal lane and the arbitrary number of lanes. (4) The evaluation result of the proposed method has lower false positive ratio than other method and the state-of-the-art accuracy performance, and it guarantee stability of the autonomous driving car.
We train a neural network for the lane detection. The network, which we will refer to as PINet, Point Instance Network, generates points on whole lane and distinguishes points into each instance. The loss function of this network is inspired from SPGN, Similarity Group Proposal Network that is a instance segmentation framework for 3D points cloud. Unlike other instance segmentation methods in computer vision area, in our proposed method, embedding is needed for only the predicted points not the all pixels. In this respect, the 3D points cloud instance segmentation method is appropriate to our task. In addition, simple post processing method is adapted to raw outputs of the network. This post processing eliminates outliers of each predicted point, and make each lane more smooth. Section II-A introduces details of the main architecture and the loss function, and Section II-B introduces the proposed post processing method.
Ii-a Lane Instance Point Network
PINet generates points and distinguishes them into each instance. Fig.2. shows the detailed architecture of the proposed network. A input size is 512x256, and it is passed to a resizing layer and a feature extraction layer. The input data that has 512x256 size is compressed into a smaller size by a sequence of the convolution layers and the max pooling layers. In this study, we experiment two cases of the resized input size, 64x32 and 32x16. The feature extraction layer is inspired from the stacked hourglass network that achieves outstanding performance on the key point prediction. PINet includes two hourglass blocks for the feature extraction. Each block has three output branches, and output grid size is same to the resized input size. Fig.3. shows the detailed architecture about the hourglass block. In Fig.3, blue boxes denote down-sampling bottleneck layers, green boxes denote same bottleneck layers, and orange boxes denote up-sampling bottleneck layers. The detail of the resizing layer and each bottleneck layer can be seen at Table I. Batch normalization and Relu layers are applied after every convolutional layer except when they are applied at the end of the output branch. The number of filters in output branch is determined by the output values. For example, the confidence branch is 1, the offset branch is 2, and the feature branch is 4. Following detailed explanations include the role and the loss function of each output branch.
Confidence branch The confidence branch predicts confidence values of each grid. The output of confidence branch has 1 channel, and it is passed to the next hourglass block, and it helps stable training. Equation 1. shows the loss function of the confidence branch.
where denote the number of grids that a point exist or non-exist in, denotes a set of grids, denotes a confidence output of the grid, denotes the ground-truth, and denotes each coefficient.
Offset branch From the offset branch, we can find the exact location for each point. Outputs of the offset branch have a value between 0 and 1, and the value means position related to a grid. In this paper, a grid is matched to 8 or 16 pixels according to the ratio between input size and output size. The offset branch has two channel for predicting x-axis offset and y-axis offset. Equation 2. shows the loss function of the offset branch.
Feature branch This branch is inspired from SGPN, a 3D points cloud instance segmentation method. The feature means an information about instance, and the branch is trained to make features of grid in the same instance more closer. Equation 3 and 4. shows the loss function of the feature branch.
where indicate whether a point i and a point j are same instance, denotes the predicted feature of point i by the proposed network, and K in constant such that . If , they are same instance, and if , these points are in the different lane to each other. When the network is trained, the loss function makes feature more closer when two points belong to same instance, and distributes them when two points belong to different instance. We can distinguish points into each instance by the distance based simple clustering technique. The feature size is set to 4, and this size is observed to have no major effect for the performance.
The total loss is equal to summation of the above three loss term, and whole network is trained by end-to-end procedure using the following total loss.
In training step, we set all coefficients 1.0 initially, and add 0.5 to and
at last few epochs. The proposed loss function is adapted to the end of each hourglass block, and it help stable training for whole network.
Ii-B Post processing method
The raw outputs of the network have some errors. For example, basically, a instance should consist of only one smooth lane. However, there are some outliers or other lane that can be distinguished visually. Fig 4. show this problem and the effect of the proposed post processing method. The detailed procedure can be seen following:
Step 1: Find six starting points. Starting points are defined as the three lowest points and the three leftmost or rightmost points. If the predicted lane is on the left related to the center of the image, the leftmost point are selected.
Step 2: Select three closest points to the starting point among points that are higher than each starting point.
Step 3: Consider a straight line connecting two points that are selected at step 1 and 2.
Step 4: Calculate the distance between the straight line and other points.
Step 5: Count the number of the points that are within the margin. The margin, , is set to 12 in this paper.
Step 6: Select the point that has maximum and larger count than threshold as new starting point, and consider that the point belong to the same cluster with starting point. We set the threshold to twenty percent of the remaining points.
Step 7: Repeat from step 2 to step 6 until no points are found at step 2. Graphical explanation can be seen in Fig. 5.
Step 8: Repeat from step 1 to step 7 for all starting point, and consider the maximum length cluster as a result lane.
Step 9: Repeat from step 1 to step 8 for all predicted lane.
The network is trained on the training set of the tuSimple dataset  by Adam optimizer with learning rate 2e-4 initially. We train the network using the initial setup during the first 1000 epochs, and we set learning rate to 1e-4, to 1.5, and to 1.5 during the last 200 epochs. Other hyper-parameters like the margin size of the post processing are determined by the experimental method. Optimized values of these hyper-parameters need to be modified slightly according to the training results. Any additional dataset and pre-trained weights are not used, and two cases of the output size, 64x32 and 32x16, are evaluated. The test hardward is NVIDIA RTX 2080ti.
Iii-a Evaluation metrics
Accuracy is the main evaluation metric of the tuSimple dataset. Accuracy is defined by following equation in the tuSimple dataset, it means the average number of the correct points.
where denotes the number of the correct predicted points from the trained module on the given image clip, and denotes the number of ground-truth point in the clip. False negative and false positive are also provided by following equation.
where denotes the number of wrongly predicted lanes, denotes the number of predicted lanes, denotes the number of missed lanes, and denotes the number of ground-truth lanes.
Our target training and test domain is the tuSimple dataset. The tuSimple dataset consists of 3626 annotated data for training, and 2782 images for testing. We apply simple data augmentation methods like flip, translation, rotation, adding Gaussian noise, changing intensity, and adding shadow to train more robust model. The tuSimple dataset has different distribution of the scenes according to the number of the lanes that are shown in the scene, and Table II. show more detail. The number of the scenes that consist of five lanes on the test set is 2 times bigger than the training set. To balance the two distribution, we set generated ratio of the data that consist of five lanes bigger than other in the data augmentation step.
|Num||Training set||Test set|
The evaluation of tuSimple dataset require exact x axis value for some fixed y axis value, and we apply simple linear interpolation to find corresponding points for the given y values. Because the distance between predicted points is close, we can estimate accurate results without any complex curve fitting method.
The detailed evaluation results can be seen in Table III, and Fig. 6. show some results on tuSimple dataset. Three cases of our proposed method show particularly low FP rate. This means that wrongly predicted lanes by our PINet are a lot rarer than other method, and it can guarantee the distinguished safety performance. Despite pre-trained weights and extra dataset are not used, our proposed method also outperforms other state of the art methods for the accuracy metric.
Table IV shows the amount of the parameters in each method, and it show that PINet is one of the lightest method among other method. Almost components of PINet is built by bottleneck layers, this architecture can save a lot of memory. The proposed method can run about 30 frames per second without the post processing, and if the post processing is applied, whole module works about 10 frames per second.
|Ours(64x32 + post)||96.70%||0.0294||0.0263|
In this study, we have proposed the novel lane detection method combining with the point estimation and the point instance segmentation method, and it can works in real-time. The method achieves the lowest false positive rate and guarantees the safety performance of the autonomous driving car because wrongly predicted lanes are rarely occurred.
The post processing method increase the performance of the lane detection module notably, but the current implemented version requires a lot of computing cost. We expect that this problem can be solved by the parallel computation or other optimization technique.
-  He, Yinghua, Hong Wang, and Bo Zhang. ”Color-based road detection in urban traffic scenes.” IEEE Transactions on intelligent transportation systems 5.4 (2004): 309-318.
-  Chiu, Kuo-Yu, and Sheng-Fuu Lin. ”Lane detection using color-based segmentation.” IEEE Proceedings. Intelligent Vehicles Symposium, 2005.. IEEE, 2005.
-  Wang, Yue, Dinggang Shen, and Eam Khwang Teoh. ”Lane detection using catmull-rom spline.” IEEE International Conference on Intelligent Vehicles. Vol. 1. 1998.
-  Lee, Chanho, and Ji-Hyun Moon. ”Robust lane detection and tracking for real-time applications.” IEEE Transactions on Intelligent Transportation Systems 19.12 (2018): 4043-4048.
-  Duda, R. O., and P. E. Hart. ”Use of the Hough transform to detect lines and curves in pictures.” Commun. ACM, vol. 15, no. 1, pp. 11–15, 1972.
-  Borkar, Amol, Monson Hayes, and Mark T. Smith. ”Robust lane detection and tracking with ransac and kalman filter.” 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 2009.
-  Van Gansbeke, Wouter, et al. ”End-to-end Lane Detection through Differentiable Least-Squares Fitting.” Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.
-  Zou, Qin, et al. ”Robust lane detection from continuous driving scenes using deep neural networks.” IEEE Transactions on Vehicular Technology (2019).
-  Yang, Wei-Jong, Yoa-Teng Cheng, and Pau-Choo Chung. ”Improved Lane Detection With Multilevel Features in Branch Convolutional Neural Networks.” IEEE Access 7 (2019): 173148-173156.
-  Neven, Davy, et al. ”Towards end-to-end lane detection: an instance segmentation approach.” 2018 IEEE intelligent vehicles symposium (IV). IEEE, 2018.
-  Chen, Zhenpeng, Qianfei Liu, and Chenfan Lian. ”PointLaneNet: Efficient end-to-end CNNs for Accurate Real-Time Lane Detection.” 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019.
-  Newell, Alejandro, Kaiyu Yang, and Jia Deng. ”Stacked hourglass networks for human pose estimation.” European conference on computer vision. Springer, Cham, 2016.
-  Yang, Wei, et al. ”Learning feature pyramids for human pose estimation.” proceedings of the IEEE international conference on computer vision. 2017.
-  Duan, Kaiwen, et al. ”Centernet: Object detection with keypoint triplets.” arXiv preprint arXiv:1904.08189 (2019).
Zhou, Xingyi, Jiacheng Zhuo, and Philipp Krahenbuhl. ”Bottom-up object detection by grouping extreme and center points.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
-  He, Kaiming, et al. ”Mask r-cnn.” Proceedings of the IEEE international conference on computer vision. 2017.
-  De Brabandere, Bert, Davy Neven, and Luc Van Gool. ”Semantic instance segmentation with a discriminative loss function.” arXiv preprint arXiv:1708.02551 (2017).
-  Wang, Weiyue, et al. ”Sgpn: Similarity group proposal network for 3d point cloud instance segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
-  The TuSimple lane challenge, http://benchmark.tusimple.ai/
-  Hou, Yuenan, et al. ”Learning lightweight lane detection cnns by self attention distillation.” Proceedings of the IEEE International Conference on Computer Vision. 2019.