Key Points Estimation and Point Instance Segmentation Approach for Lane Detection

02/16/2020 ∙ by YeongMin Ko, et al. ∙ Gwangju Institute of Science and Technology 0

State-of-the-art lane detection methods achieve successful performance. Despite their advantages, these methods have critical deficiencies such as the limited number of detectable lanes and high false positive. In especial, high false positive can cause wrong and dangerous control. In this paper, we propose a novel lane detection method for the arbitrary number of lanes using the deep learning method, which has the lower number of false positives than other recent lane detection methods. The architecture of the proposed method has the shared feature extraction layers and several branches for detection and embedding to cluster lanes. The proposed method can generate exact points on the lanes, and we cast a clustering problem for the generated points as a point cloud instance segmentation problem. The proposed method is more compact because it generates fewer points than the original image pixel size. Our proposed post processing method eliminates outliers successfully and increases the performance notably. Whole proposed framework achieves competitive results on the tuSimple dataset.



There are no comments yet.


page 1

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To achieve fully autonomous driving, it is required to understand the environment around the car, and various perception modules are fused for the understanding. Lane detection module is one of the main modules. It is included in not only the fully autonomous driving system but also partial autonomous driving system already developed such as Advanced Driver Assistant System(ADAS) and the cruise control system. There are many modules that are derived from lane detection module such as Simultaneous Localization And Mapping(SLAM), lane centering function in ADAS, etc. Despite various sensors can be used for the perception modules, a RGB camera is considered one of the most important sensors because it has a low price. However, because a RGB image has only pixel color information, some feature extraction method should be applied to RGB image for the inference of other useful information like lane, object location, etc.

Most traditional methods for lane detection extract low-level features of lanes firstly using various hand-craft features like color(e.g. [1], [2]), edge(e.g. [3], [4]), etc

. These low-level features can be combined by Hough transform [5] and Kalman filter [6], and the combined features generate lane segment information. These methods are simple and can be adapted to many various environments without major modification, but the performance of these methods depends on the test environment such as the lighting conditions and the occlusion.

Fig. 1: The proposed framework. Given an input image, PINet predict three value, confidence, offset, and feature. From confidence and offset outputs, exact points on the lanes can be predicted, and the feature output distinguishes the predicted points into each instance. Finally, the post processing module is applied, and it generates smooth lane.

Deep learning methods have outstanding performance in the complex scene. Among deep learning methods, Convolutional Neural Network(CNN) methods are especially used in the field of computer vision. Many recent methods apply the CNN for the feature extract [7, 8]. Semantic segmentation methods are frequently applied to the lane detection problem for inference about the shape and the location of lanes[9, 10]. These methods distinguish instance and label of the pixels on whole image. Despite these methods achieve outstanding performance, they can only be applied to the scene that consist of the fixed number of the lanes because of their multi-class approach to distinguish each lane. Neven

et al. [11] cast this problem to the instance segmentation. LaneNet that they proposed has a shared encoder for feature extraction and two decoders. One of them performs the binary lane segmentation, and other is embedding branch for instance segmentation. Because LaneNet applies the instance segmentation method, it can detect the arbitrary number of lanes. Chen et al. [12] propose the network that predicts directly x axis values for the fixed y values on each lane, but the method only works on the vertical lane detection.

Fig. 2:

The detailed network training procedure. It has three main parts. 512x256 size input data is compressed by the resizing layer, and the compressed input is passed to feature extraction layer. Three output branches are applied at end of each hourglass block, and they predict confidence, offset, and instance feature for each grid. Loss function can be calculated from outputs of each hourglass block.

Our proposed method predicts the fewer exact points on the lanes than input pixel size and distinguishes each point into each instance. The hourglass network [13] is usually applied to the field of the key points estimation like pose estimation [14] and object detection [15, 16]. The hourglass network can extract information about various scales by sequences of down-sampling and up-sampling. If some hourglass networks are stacked, loss function can be applied to each stacked network, and it help more stable training. The instance segmentation methods in the field of the computer vision can generate clusters of the pixels that belong to each instance [17, 18].

Camera-based lane detection has been actively developed, and many state-of-the-art methods work almost completely in some public data sets. However, these methods have some weaknesses like the limited number of lanes that the module can detect and high false positive. The false negatives, the lanes that the module misses to detect, do not change control value suddenly, and correct control values can be calculated by other detected lanes. However, the false positive can lead to serious risks. The false positives, the wrong lanes that the module alarms, can cause the rapid change of the control values.

Fig.1. shows the proposed framework for lane detection. It has three output branches and predicts the exact location and the instance features of the points on the lanes. More details are introduced Section II. In summary, there are primary contributions of this study: (1) We propose the novel method for the lane detection that has more compact output size than semantic segmentation based methods, and the compact size can save memory of the module. (2) The proposed post processing method can eliminate outliers successfully and increases the performance. (3) The proposed method can be applied to various scenes that include any oriented lanes like vertical or horizontal lane and the arbitrary number of lanes. (4) The evaluation result of the proposed method has lower false positive ratio than other method and the state-of-the-art accuracy performance, and it guarantee stability of the autonomous driving car.

Ii Method

We train a neural network for the lane detection. The network, which we will refer to as PINet, Point Instance Network, generates points on whole lane and distinguishes points into each instance. The loss function of this network is inspired from SPGN, Similarity Group Proposal Network[19] that is a instance segmentation framework for 3D points cloud. Unlike other instance segmentation methods in computer vision area, in our proposed method, embedding is needed for only the predicted points not the all pixels. In this respect, the 3D points cloud instance segmentation method is appropriate to our task. In addition, simple post processing method is adapted to raw outputs of the network. This post processing eliminates outliers of each predicted point, and make each lane more smooth. Section II-A introduces details of the main architecture and the loss function, and Section II-B introduces the proposed post processing method.

Ii-a Lane Instance Point Network

Fig. 3: The hourglass block and bottleneck layer architecture. The hourglass block consist three types of bottleneck layers, same, up-sampling, down-sampling. Output branches are applied at end of hourglass layer, and the confidence output is forwarded to next block.

PINet generates points and distinguishes them into each instance. Fig.2. shows the detailed architecture of the proposed network. A input size is 512x256, and it is passed to a resizing layer and a feature extraction layer. The input data that has 512x256 size is compressed into a smaller size by a sequence of the convolution layers and the max pooling layers. In this study, we experiment two cases of the resized input size, 64x32 and 32x16. The feature extraction layer is inspired from the stacked hourglass network that achieves outstanding performance on the key point prediction. PINet includes two hourglass blocks for the feature extraction. Each block has three output branches, and output grid size is same to the resized input size. Fig.3. shows the detailed architecture about the hourglass block. In Fig.3, blue boxes denote down-sampling bottleneck layers, green boxes denote same bottleneck layers, and orange boxes denote up-sampling bottleneck layers. The detail of the resizing layer and each bottleneck layer can be seen at Table I. Batch normalization and Relu layers are applied after every convolutional layer except when they are applied at the end of the output branch. The number of filters in output branch is determined by the output values. For example, the confidence branch is 1, the offset branch is 2, and the feature branch is 4. Following detailed explanations include the role and the loss function of each output branch.

Type Filter


Resizing layer Conv 64 7/2
same bottleneck 64
max pooling 64 2/2
same bottleneck 64
max pooling 64 2/2
same bottleneck 128
Same bottleneck Conv 32 1/1
Conv 32 3/1
Conv 128 1/1
(Residual) Conv 128 1/1
Up-sampling bottleneck Conv 32 1/1
ConvTranspose 32 3/2
Conv 128 1/1
(Residual) ConvTranspose 128 3/2
Down-sampling bottleneck Conv 32 1/1
Conv 32 3/2
Conv 128 1/1
(Residual) Conv 128 3/2
TABLE I: The detail of the resizing layer and each bottleneck(the case of the output size, 64x32)

Confidence branch The confidence branch predicts confidence values of each grid. The output of confidence branch has 1 channel, and it is passed to the next hourglass block, and it helps stable training. Equation 1. shows the loss function of the confidence branch.


where denote the number of grids that a point exist or non-exist in, denotes a set of grids, denotes a confidence output of the grid, denotes the ground-truth, and denotes each coefficient.

Offset branch From the offset branch, we can find the exact location for each point. Outputs of the offset branch have a value between 0 and 1, and the value means position related to a grid. In this paper, a grid is matched to 8 or 16 pixels according to the ratio between input size and output size. The offset branch has two channel for predicting x-axis offset and y-axis offset. Equation 2. shows the loss function of the offset branch.


Feature branch This branch is inspired from SGPN, a 3D points cloud instance segmentation method. The feature means an information about instance, and the branch is trained to make features of grid in the same instance more closer. Equation 3 and 4. shows the loss function of the feature branch.


where indicate whether a point i and a point j are same instance, denotes the predicted feature of point i by the proposed network, and K in constant such that . If , they are same instance, and if , these points are in the different lane to each other. When the network is trained, the loss function makes feature more closer when two points belong to same instance, and distributes them when two points belong to different instance. We can distinguish points into each instance by the distance based simple clustering technique. The feature size is set to 4, and this size is observed to have no major effect for the performance.

The total loss is equal to summation of the above three loss term, and whole network is trained by end-to-end procedure using the following total loss.


In training step, we set all coefficients 1.0 initially, and add 0.5 to and

at last few epochs. The proposed loss function is adapted to the end of each hourglass block, and it help stable training for whole network.

Ii-B Post processing method

The raw outputs of the network have some errors. For example, basically, a instance should consist of only one smooth lane. However, there are some outliers or other lane that can be distinguished visually. Fig 4. show this problem and the effect of the proposed post processing method. The detailed procedure can be seen following:

  • Step 1: Find six starting points. Starting points are defined as the three lowest points and the three leftmost or rightmost points. If the predicted lane is on the left related to the center of the image, the leftmost point are selected.

  • Step 2: Select three closest points to the starting point among points that are higher than each starting point.

  • Step 3: Consider a straight line connecting two points that are selected at step 1 and 2.

  • Step 4: Calculate the distance between the straight line and other points.

  • Step 5: Count the number of the points that are within the margin. The margin, , is set to 12 in this paper.

  • Step 6: Select the point that has maximum and larger count than threshold as new starting point, and consider that the point belong to the same cluster with starting point. We set the threshold to twenty percent of the remaining points.

  • Step 7: Repeat from step 2 to step 6 until no points are found at step 2. Graphical explanation can be seen in Fig. 5.

  • Step 8: Repeat from step 1 to step 7 for all starting point, and consider the maximum length cluster as a result lane.

  • Step 9: Repeat from step 1 to step 8 for all predicted lane.

Fig. 4: The result of the post processing. (a) is input image, and (b) is raw output out PINet. In (b), the blue lane consist of some outliers and other lane can be distinguished. In (c), the result of the proposed post processing method, outliers are eliminated, and only smooth longest lanes remain.
Fig. 5: The explanation about the post processing. There are no other point in the margin that is made by the straight line connecting point S and A, but margin of point S and B consist of 2 other points. As a result, point B is selected.

Iii Result

The network is trained on the training set of the tuSimple dataset [20] by Adam optimizer with learning rate 2e-4 initially. We train the network using the initial setup during the first 1000 epochs, and we set learning rate to 1e-4, to 1.5, and to 1.5 during the last 200 epochs. Other hyper-parameters like the margin size of the post processing are determined by the experimental method. Optimized values of these hyper-parameters need to be modified slightly according to the training results. Any additional dataset and pre-trained weights are not used, and two cases of the output size, 64x32 and 32x16, are evaluated. The test hardward is NVIDIA RTX 2080ti.

Iii-a Evaluation metrics

Accuracy is the main evaluation metric of the tuSimple dataset. Accuracy is defined by following equation in the tuSimple dataset, it means the average number of the correct points.


where denotes the number of the correct predicted points from the trained module on the given image clip, and denotes the number of ground-truth point in the clip. False negative and false positive are also provided by following equation.


where denotes the number of wrongly predicted lanes, denotes the number of predicted lanes, denotes the number of missed lanes, and denotes the number of ground-truth lanes.

Fig. 6: The results on the tuSimple dataset. First row is ground-truth data, second and third row are raw outputs of the proposed network and final outputs after post processing.

Iii-B Experiments

Our target training and test domain is the tuSimple dataset. The tuSimple dataset consists of 3626 annotated data for training, and 2782 images for testing. We apply simple data augmentation methods like flip, translation, rotation, adding Gaussian noise, changing intensity, and adding shadow to train more robust model. The tuSimple dataset has different distribution of the scenes according to the number of the lanes that are shown in the scene, and Table II. show more detail. The number of the scenes that consist of five lanes on the test set is 2 times bigger than the training set. To balance the two distribution, we set generated ratio of the data that consist of five lanes bigger than other in the data augmentation step.

Num Training set Test set
Five 239 569
Four 2982 468
Three 404 1740
Two 1 5
Total 3626 2782
TABLE II: Distribution of the scenes according to the number of lanes on training and test set

The evaluation of tuSimple dataset require exact x axis value for some fixed y axis value, and we apply simple linear interpolation to find corresponding points for the given y values. Because the distance between predicted points is close, we can estimate accurate results without any complex curve fitting method.

The detailed evaluation results can be seen in Table III, and Fig. 6. show some results on tuSimple dataset. Three cases of our proposed method show particularly low FP rate. This means that wrongly predicted lanes by our PINet are a lot rarer than other method, and it can guarantee the distinguished safety performance. Despite pre-trained weights and extra dataset are not used, our proposed method also outperforms other state of the art methods for the accuracy metric.

Table IV shows the amount of the parameters in each method, and it show that PINet is one of the lightest method among other method. Almost components of PINet is built by bottleneck layers, this architecture can save a lot of memory. The proposed method can run about 30 frames per second without the post processing, and if the post processing is applied, whole module works about 10 frames per second.

SCNN[10] 96.53% 0.0617 0.0180
LaneNet(+H-net)[11] 96.38% 0.0780 0.0244
PointLaneNet(MoblieNet)[12] 96.34% 0.0467 0.0518
ENet-SAD[21] 96.64% 0.0602 0.0205
Ours(32x16) 95.75% 0.0266 0.0362
Ours(64x32) 96.62% 0.0308 0.0272
Ours(64x32 + post) 96.70% 0.0294 0.0263
TABLE III: Evaluation result on the tuSimple dataset
Model Parameters(M)
SCNN 20.72
LaneNet(+H-net) 15.98
PointLaneNet(MoblieNet) 5.33
ENet-SAD 0.98
R-18-SAD 12.41
Ours(32x16) 4.40
Ours(64x32) 4.39
TABLE IV: Model analysis

Iv Conclusions

In this study, we have proposed the novel lane detection method combining with the point estimation and the point instance segmentation method, and it can works in real-time. The method achieves the lowest false positive rate and guarantees the safety performance of the autonomous driving car because wrongly predicted lanes are rarely occurred.

The post processing method increase the performance of the lane detection module notably, but the current implemented version requires a lot of computing cost. We expect that this problem can be solved by the parallel computation or other optimization technique.


  • [1] He, Yinghua, Hong Wang, and Bo Zhang. ”Color-based road detection in urban traffic scenes.” IEEE Transactions on intelligent transportation systems 5.4 (2004): 309-318.
  • [2] Chiu, Kuo-Yu, and Sheng-Fuu Lin. ”Lane detection using color-based segmentation.” IEEE Proceedings. Intelligent Vehicles Symposium, 2005.. IEEE, 2005.
  • [3] Wang, Yue, Dinggang Shen, and Eam Khwang Teoh. ”Lane detection using catmull-rom spline.” IEEE International Conference on Intelligent Vehicles. Vol. 1. 1998.
  • [4] Lee, Chanho, and Ji-Hyun Moon. ”Robust lane detection and tracking for real-time applications.” IEEE Transactions on Intelligent Transportation Systems 19.12 (2018): 4043-4048.
  • [5] Duda, R. O., and P. E. Hart. ”Use of the Hough transform to detect lines and curves in pictures.” Commun. ACM, vol. 15, no. 1, pp. 11–15, 1972.
  • [6] Borkar, Amol, Monson Hayes, and Mark T. Smith. ”Robust lane detection and tracking with ransac and kalman filter.” 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 2009.
  • [7] Van Gansbeke, Wouter, et al. ”End-to-end Lane Detection through Differentiable Least-Squares Fitting.” Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.
  • [8] Zou, Qin, et al. ”Robust lane detection from continuous driving scenes using deep neural networks.” IEEE Transactions on Vehicular Technology (2019).
  • [9] Yang, Wei-Jong, Yoa-Teng Cheng, and Pau-Choo Chung. ”Improved Lane Detection With Multilevel Features in Branch Convolutional Neural Networks.” IEEE Access 7 (2019): 173148-173156.
  • [10]

    Pan, Xingang, et al. ”Spatial as deep: Spatial cnn for traffic scene understanding.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

  • [11] Neven, Davy, et al. ”Towards end-to-end lane detection: an instance segmentation approach.” 2018 IEEE intelligent vehicles symposium (IV). IEEE, 2018.
  • [12] Chen, Zhenpeng, Qianfei Liu, and Chenfan Lian. ”PointLaneNet: Efficient end-to-end CNNs for Accurate Real-Time Lane Detection.” 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019.
  • [13] Newell, Alejandro, Kaiyu Yang, and Jia Deng. ”Stacked hourglass networks for human pose estimation.” European conference on computer vision. Springer, Cham, 2016.
  • [14] Yang, Wei, et al. ”Learning feature pyramids for human pose estimation.” proceedings of the IEEE international conference on computer vision. 2017.
  • [15] Duan, Kaiwen, et al. ”Centernet: Object detection with keypoint triplets.” arXiv preprint arXiv:1904.08189 (2019).
  • [16]

    Zhou, Xingyi, Jiacheng Zhuo, and Philipp Krahenbuhl. ”Bottom-up object detection by grouping extreme and center points.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

  • [17] He, Kaiming, et al. ”Mask r-cnn.” Proceedings of the IEEE international conference on computer vision. 2017.
  • [18] De Brabandere, Bert, Davy Neven, and Luc Van Gool. ”Semantic instance segmentation with a discriminative loss function.” arXiv preprint arXiv:1708.02551 (2017).
  • [19] Wang, Weiyue, et al. ”Sgpn: Similarity group proposal network for 3d point cloud instance segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
  • [20] The TuSimple lane challenge,
  • [21] Hou, Yuenan, et al. ”Learning lightweight lane detection cnns by self attention distillation.” Proceedings of the IEEE International Conference on Computer Vision. 2019.