Autonomous driving  is a challenging field of research that has received a lot of attention in recent years. The perceptual problems related to this field have been immensely impacted by the advances in deep learning [2, 3, 4]
. In particular, autonomous vehicles should be capable of estimating traffic lanes because, besides working as a spatial limit, each lane provides specific visual cues ruling the travel. In this context, the two most important traffic lines (i.e., lane markings) are those defining the lane of the vehicle, i.e., the ego-lane. These lines set the limits for the driver’s actions and their types define whether or not maneuvers (e.g., lane changes) are allowed. Also, it might be useful to detect the adjacent lanes so that the systems’ decisions might be based on a better understanding of the traffic scene.
Lane estimation (or detection) may seem trivial at first, but it can be very challenging. Although fairly standardized, lane markings vary in shape and colour. Estimating a lane when dashed or partially occluded lane markers are presented requires a semantic understanding of the scene. Moreover, the environment itself is inherently diverse: there may be a lot of traffic, people passing by, or it could be just a free highway. In addition, these environments are subject to several weather (e.g., rain, snow, sunny, etc.) and illumination (e.g., day, night, dawn, tunnels, etc.) conditions, which might just change while driving.
followed by a curve-fitting process. Although this approach tends to work well under normal and limited circumstances, it is not usually as robust as needed in adverse conditions (as the aforementioned ones). Therefore, following the trend in many computer vision problems, deep learning has recently started to be used to learn robust features and improve the lane marking estimation process[7, 8, 9]
. Once the lane markings are estimated, further processing can be performed to determine the actual lanes. Still, there are limitations to be tackled. First, many of these deep learning-based models tackle the lane marking estimation as a two-step process: feature extraction and curve fitting. Most works extract features via segmentation-based models, which usually are inefficient and have trouble to run in real-time, as required for autonomous driving. Additionally, the segmentation step alone is not enough for providing a lane marking estimate since the segmentation maps have to be post-processed in order to output traffic lines. Further, these two-step processes might ignore global information, which are specially important when there are missing visual cues (as in strong shadows and occlusions). Second, some of these works are performed by private companies that often (i) do not provide means to replicate their results and (ii) develop their methods on private datasets, which hinders research progress. Lastly, there is room for improvement in the evaluation protocol. The methods are usually tested on datasets from the USA only (roads in developing countries are usually not as well maintained) and the evaluation metrics are too permissive (they allow error in such a way that it hinders proper comparisons), as discussed in Section IV.
In this context, methods focusing on removing the need for a two-step process further reducing the processing cost could benefit advanced driver assistance systems (ADAS) that often rely on low-energy and embedding hardware. In addition, a method that has been tested on roads other than the USA’s is also of benefit to the broader community. Moreover, less permissive metrics would allow to better differentiate methods and provide a clearer overview of the methods and their usefulness.
This work proposes PolyLaneNet, a convolutional neural network (CNN) for end-to-end lane markings estimation. PolyLaneNet takes as input images from a forward-looking camera mounted in the vehicle and outputs polynomials that represent each lane marking in the image, along with the domains for these polynomials and confidence scores for each lane. This approach is shown to be competitive with existing state-of-the-art methods, while being faster and not requiring post-processing to have the lane estimates. In addition, we provide a deeper analysis using metrics suggested by the literature. Finally, we publicly released the source-code (for both training and inference) and the trained models, allowing the replication of all the results presented in this paper.
Ii Related Works
Lane Detection. Before the rise of deep learning, methods on lane detection were mostly model- or learning-based, i.e., they used to exploit hand-crafted and specialized features. Shape and color were the most commonly used features [10, 11], and lanes were normally represented both by straight and curved lines [12, 13]. These methods, however, were not robust to sudden illumination changes, weather conditions, differences in appearance between cameras, and many other things that can be found in driving scenes. The interested reader is referred to  for a more complete survey on earlier lane detection methods.
With the success of deep learning, researchers have also investigated its use to tackle lane detection. Huval et al.  were one of the first to use deep learning in lane detection. Their model is based on the OverFeat and produces as output a sort of segmentation map that is later post-processed using DBSCAN clustering. They collected a private dataset on San Francisco (USA) that was used to train and evaluate their system. Because of the success of their application, companies were also interested in investigating this problem. Later, Ford released DeepLanes , which unlike most of the literature, detects lanes based on laterally-mounted cameras. Despite the good results, the way they modeled the problem made it less widely applicable, and they also used a private US-based dataset.
, a method proposed for traffic scene understanding that exploits the propagation of spatial information via specially designed CNN structure. Their model outputs a probability map for the lanes that are post-processed in order to provide the lane estimates. To evaluate their system, they used an evaluation metric that is based on the IoU between the prediction and the ground-truth. After that, in, the authors proposed Line-CNN, a model in which the key component is the line proposal unit (LPU) adapted from the region proposal network (RPN) of Faster R-CNN. They also submitted their results to the TuSimple benchmark (after the challenge was finished) with marginally better results compared to SCNN. Their main experiments, though, were with a much larger dataset that was not publicly released. In addition to this private dataset, the source-code is proprietary and the authors will not release it. Another approach is the FastDraw  in which the common post-processing of segmentation-based methods is substituted by “drawing” the lanes according to the likelihood of polylines that are maximized at training time. In addition to evaluating on the TuSimple and CULane  datasets, the authors provide qualitative results on yet another private US-based dataset. Moreover, they did not release their implementation, which hinders further comparisons. Some of the segmentation-based methods focus on improving the inference speed, as in  (ENet-SAD) which focuses on learning lightweight CNNs by exploiting self attention distillation. The authors evaluated their method on three well-known datasets. Although the source-code was publicly released, some of the results are not reproducible111According to the author of , the difference in performance comes from engineering tricks neither described in the paper nor included in the available code: https://github.com/cardwing/Codes-for-Lane-Detection/issues/208.
In summary, one of the main problems with existing state-of-the methods is reproducibility, since most either do not publish the datasets used or the source code. In this work, we present results that are competitive with state-of-the-art methods on public datasets and fully reproducible, since we provide the source code and use only publicly available datasets (including one from outside the US).
Model Definition. PolyLaneNet expects as input images taken from a forward-looking vehicle camera, and outputs, for each image, lane marking candidates (represented as polynomials), as well as the vertical position of the horizon line, which helps to define the upper limit of the lane markings. The architecture of PolyLaneNet consists in a backbone network (for feature extraction) appended with a fully connected layer with outputs, being the outputs for lane marking prediction and the output for . PolyLaneNet adopts a polynomial representation for the lane markings instead of a set of points. Therefore, for each output , , the model estimates the coefficients representing the polynomial
where is a parameter that defines the order of the polynomial. As illustrated in Figure 1, the polynomials have restricted domain: the height of the image. Besides the coefficients, the model estimates, for each lane marking , the vertical offset , and the prediction confidence score . In summary, the PolyLaneNet model can be expressed as
where is the input image and is the model parameters. In a system in operation, as illustrated in Figure 1, only the lane marking candidates whose confidence score is greater or equal than a threshold are considered as detected.
Model Training. For an input image, let be the number of annotated lane markings given an input image. In general, traffic scenes contain few lanes, being for most images in the available datasets. For training (and metric evaluation), each annotated lane marking ,
, is associated to the neuron unitof the output. Therefore, predictions related to the outputs
should be disregarded in the loss function. An annotated lane markingis represented by a set of points , where , for every . As a rule of thumb, the higher is , the more it allows to capture richer structures. We assume that the lane markings are ordered according to the -coordinate of the point closest to the bottom of the image. i.e., iff , then . For each lane marking , the vertical offset was set as ; the confidence score is defined as
The model is trained using the multi-task loss function defined as (for a single image)
where , , , and are constant weights used for balancing. The regressions and are the Mean Squared Error (MSE) and Binary Cross Entropy (BCE) functions, respectively. The loss function measures how well adjusted is the polynomial (Equation 1) to the annotated points. Consider the annotated -coordinates and where
where is an empirically defined threshold that tries to reduce the focus of the loss on points that are already well aligned. Such effect appears because the lane markings comprise several points with different sampling differences (i.e., points closer to the camera are denser than points further away). Finally, is defined as
Iv Experimental Methodology
PolyLaneNet was evaluated on publicly available which are introduced in this section. Following, the section describes the implementation details, the metrics and the experiments performed.
Three datasets were used to evaluate PolyLaneNet: TuSimple , LLAMAS  and ELAS . For quantitative results, the widely-used TuSimple  was employed. The dataset has a total of 6,408 annotated images with a resolution of 1280720 pixels, and it is originally split in 3,268 for training, 358 for validation and 2,782 for testing. For qualitative results, two other datasets were used: LLAMAS  and ELAS . The first is a large dataset, split into 58,269 images for training, 20,844 for validation, and 20,929 for test, with a resolution of 1280717 pixels. Both TuSimple and LLAMAS are datasets from the USA. Since neither the benchmark nor the test set annotations for LLAMAS are available yet, only qualitative results are presented. ELAS is a dataset with 16,993 images from various cities in Brazil, with a resolution of 640480 pixels. Since the dataset was originally proposed for a non-learning based method, it does not provide training/testing splits. Thus, we created those splits by separating 11,036 images for training and 5,957 for test. The main difference between ELAS and the other two datasets is that in ELAS only the ego-lane is annotated. Nonetheless, the dataset also provides other types of useful information for the lane detection task, such as lane types (e.g., solid or dashed, white or yellow), but they are not used in this paper.
Iv-B Implementation details
The hyperparameters for every experiment in this work were the same, with the exception of the ablation study, where in each training one parameter was modified. For the backbone network, the EfficientNet-b0 was used. For the TuSimple training, data augmentation was applied with a probability of . The transformations used were: rotation with an angle in degrees , horizontal flip with a probability of 0.5, and a random crop with size 1152648 pixels. After the data augmentation, the following transformations were applied: a resize to 640
360 pixels and then a normalization with ImageNet’s
mean and standard deviation. The Adam optimizer was used, along with the Cosine Annealing learning rate scheduler with an initial learning rate of 3e-4 and a period of 770 epochs. The training session ran for 2695 epochs, taking approximately 35 hours on a Titan V, with a batch size of 16 images, from a model pretrained on ImageNet. A third order polynomial degree was chosen to be the default. For the loss function, the parameters and were used. The threshold (Equation 5) was set to 20 pixels. In the testing phase, lane markings predicted with a confidence score were ignored. For more details, the source code and trained models are publicly available222https://github.com/lucastabelini/PolyLaneNet.
Iv-C Evaluation Metrics
The metrics used to measure the proposed method’s performance come from TuSimple’s benchmark . The three metrics are: accuracy , false positive and false negative rates. For a predicted lane marking to be considered a true positive (i.e., a correct one), its accuracy, defined as
has to be equal to or greater than . The values used for and were 20 pixels and 0.85, respectively, the same ones used in TuSimple’s benchmark. All the three reported metrics and are reported as the average across all images of the average of each image.
Although TuSimple’s metric has been widely used in the literature, it is too permissive w.r.t. local errors. To avoid relying on only such metric, we also used a metric proposed in , which discusses several evaluation metrics of interest to the lane estimation process. The Lane Position Deviation (LPD) was proposed to better capture the accuracy of the model on both the near and far depths of view of the ego-vehicle. It is the error between the prediction and the ground-truth for the ego-lane. To define what are the ego-lanes (given that the dataset labels and our model are agnostic to this definition), we use a simple definition: the lane markings that are closer to the middle of the bottom part of the image are the ones that compose the ego-lane, i.e., one lane marking to the left and another one to the right.
In addition to metrics w.r.t. the quality of the predictions, we also report two speed-related metrics: frames-per-second (FPS) and MACs333For reference, roughly speaking, one MAC (Multiply-Accumulate) is equivalent to 2 FLOPS. MACs were computed using the following library: https://github.com/mit-han-lab/torchprofile.. The frames-per-second provide a concrete assessment of how fast an implementation can run on a modern computer with a recent GPU, whereas MACs provide a more reliable way to compare different methods that might be running on different frameworks and setups. As discussed in , analyzing the trade-off between computation efficiency and accuracy is also important. In this paper, we provide such an analysis by reporting the MACs of PolyLaneNet variants with different computational requirements in an ablation study.
Iv-D Quantitative Evaluation
State-of-the-art Comparison. The main quantitative experiment for the proposed method is the comparison against state-of-the-art methods using the same evaluation conditions. For that, the proposed method was used to train a model using an union of TuSimple’s training and validation sets, and then evaluated in its testing set. Four state-of-the-art methods were compared: SCNN , Line-CNN , ENet-SAD , and FastDraw . Besides prediction quality metrics, model speed w.r.t. FPS is also presented. For our model, we also reported the MACs.
Polynomial Degree. In most lane marking detection datasets, it is clear that lane markings with a more accentuated curvature are rarer, while straight ones represent the majority of the cases. With this in mind, one might enquire: what would be the impact of modeling lane markings with polynomials of lower orders? To help answer this question, our method was evaluated using first- and second-order polynomials, instead of the default of third-order polynomials. Furthermore, we also show the permissiveness of the standard TuSimple’s metric used by the literature by computing upperbounds for polynomials of different orders.
Ablation Study. To investigate the impact of some of the decisions made for the proposed method, an ablation study was carried out, using only TuSimple’s training set for training and the validation set for testing. For the model backbone , ResNet  was evaluated, on two of its variants: ResNet-34 and 50. Another variant of EfficientNet was also evaluated, the EfficientNet-b1. Moreover, when training CNNs, in addition to the impact of the backbone, there is a trade-off when using different image input sizes. For example, if a smaller input size is used, the network forward will be faster, but information may be lost. To measure this trade-off  in the proposed method, two additional models were trained, one using an input size of pixels, and the other using an input size of pixels. Additionally, three other practical decisions were evaluated: (i) the impact of not sharing (i.e., having the end of each lane predicted individually), (ii) the use of a pre-trained model, by training from scratch instead of a model pretrained on ImageNet; and (iii) the impact of using data augmentation, by removing the online data augmentation, which reduces the variability seen by the model at training time.
Iv-E Qualitative Evaluation
For qualitative results, an extensive evaluation was carried out. Using the model trained on TuSimple as a pretraining, three models were trained: two on ELAS, one with and one without lane marking type classification, and another on LLAMAS. On ELAS, the model was trained for 385 additional epochs (half of a period of the chosen learning rate scheduler, where the learning rate will be at a minimum). On LLAMAS, the model was trained for 75 additional epochs, an approximation to the number of iterations used on ELAS, as the training set of LLAMAS is around five times larger than the one of ELAS. The experiment with lane marking type classification is an straightforward extension of PolyLaneNet, in which a category is predicted for each lane showcasing how trivial it is to extend our model.
First, we present the results of the comparison with the state-of-the-art. Then, the results of ablation study are detailed and discussed. Finally, qualitative results are shown.
State-of-the-art Comparison. The state-of-the-art results on the TuSimple dataset are presented in Table I. As evidenced, PolyLaneNet results are competitive. Since none of the compared methods provide source codes that replicate their respective published results, it is very difficult to investigate situations where the other methods succeed that ours fail. On Figure 2, some qualitative results of PolyLaneNet on TuSimple are shown. It is noticeable that PolyLaneNet’s predictions on parts of the lane marking closer the camera (where more details can be seen) are very accurate. Nonetheless, on parts of the lane marking closer to the horizon, the predictions are less accurate. We conjecture that this might be a result of a local minimum, caused by the dataset’s imbalance. Since most lane markings in the dataset can be represented fairly well with 1st order polynomials (i.e., lines), the neural network has a bias towards predicting lines, thus the poor performance on lane markings with accentuated curvature.
PP = Requires Post-Processing.
Polynomial Degree. In terms of the polynomial degree used to represent the lane marking, the small difference in accuracy when using lower order polynomials shows how unbalanced the dataset is. Using 1st order polynomials (i.e., lines) decreased the accuracy by only 0.35 p.p. Although the dataset’s imbalance certainly has impact on this, another important factor is the metric used by the benchmark to evaluate a model’s performance. The LPD metric , however, is able to better capture the difference between the models trained using 1st order polynomials and the others. This can be further seen in Table III, which shows the maximum performance (i.e., the upperbound) of methods that represent lane markings as polynomials, measured by fitting polynomials on the test data itself. As it can be seen, TuSimple’s metric does not punish predictions that are accurate only in parts of the lane marking closer to the car, where in the image it will look almost straight (i.e., can be represented well by 1st order polynomials), since the thresholds may hide those mistakes. Meanwhile, the LDP metric clearly distinguishes the upperbounds, showing a clear difference even between the 4th and 5th degrees, in which TuSimple’s metrics are almost identical.
w.r.t. Polynomial Degree
Ablation Study. The ablation study results are shown in Table IV. EfficientNet-b1 achieved the highest accuracy, followed by EfficientNet-b0 and ResNet-34. Those results suggest that larger networks, such as ResNet-50, may overfit the data. Although EfficientNet-b1 achieved the highest accuracy, we chose not to use it in other experiments, as the accuracy gains are not significant nor consistent in our experiments. In addition, it is more computationally expensive (i.e., lower FPS, higher MACs, and longer training times). In regards to the input size, reducing it also means reducing the accuracy, as expected. In some cases, this accuracy loss may not be significant, but the speed gains may be. For example, using an input size of 480270 decreased the accuracy by only 0.55 p.p., but the model MACs decreased by 1.82 times.
w.r.t Backbone and Input Size
As to the other ablation studies we carried out, one can see that sharing the top-y () is slightly better than not sharing. Moreover, training from a model pretrained on ImageNet seems to have a significant impact on the final result, as shown by the difference of 4.26 p.p. The same happens with data augmentation, as the model trained with more data has a significantly higher accuracy.
Qualitative Evaluation. A sample of the qualitative results on ELAS and LLAMAS are shown in Figure 3. For more extensive results, videos are available444Qualitative results (videos) on ELAS/LLAMAS: https://drive.google.com/drive/folders/136tuy11n-Q_mAVbsnbjZnMPn-9VBMwf3
. The results show that transfer learning works well on PolyLaneNet, since a smaller number of epochs was enough to obtain reasonable results in different datasets. However, in ELAS, there are many lane changes. In those situations, the model’s accuracy decreased significantly. Since the images of those situations have a very different structure (e.g., the car is not heading towards the road direction), the low amount of samples in this situation may have not been enough for the model to learn it.
In this work, a novel method for lane detection based on deep polynomial regression was proposed. The proposed method is simple and efficient, while maintaining competitive accuracy when compared to state-of-the-art methods. Although works with state-of-the-art methods with slightly higher accuracy exist, most do not provide source code to replicate their results, therefore deeper investigations on differences between methods are difficult. Our method, besides being computationally efficient, will be publicly available so that future works on lane markings detection have a baseline to start work and for comparison. Furthermore, we’ve shown problems on the metrics used to evaluate lane markings detection methods. For future works, metrics that can be used across different approaches to lane detection (e.g., segmentation) and that better highlights flaws in lane detection methods can be explored.
-  C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso, A. Forechi, L. Jesus, R. Berriel, T. Paixão, F. Mutz et al., “Self-driving Cars: A Survey,” arXiv preprint arXiv:1901.04407, 2019.
-  L. C. Possatti, R. Guidolini, V. B. Cardoso, R. F. Berriel, T. M. Paixão, C. Badue, A. F. De Souza, and T. Oliveira-Santos, “Traffic light recognition using deep learning and prior maps for autonomous cars,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8.
-  P. Yang, G. Zhang, L. Wang, L. Xu, Q. Deng, and M.-H. Yang, “A part-aware multi-scale fully convolutional network for pedestrian detection,” IEEE Transactions on Intelligent Transportation Systems, 2020.
-  D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, 2020.
-  J. C. McCall and M. M. Trivedi, “Video Based Lane Estimation and Tracking for. Driver Assistance: Survey, System, and Evaluation,” IEEE Transactions on Intelligent Transportation Systems, vol. 7, no. 1, pp. 20–37, 2006.
-  R. F. Berriel, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Ego-Lane Analysis System (ELAS): Dataset and Algorithms,” Image and Vision Computing, vol. 68, pp. 64–75, 2017.
X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial As Deep: Spatial CNN
for Traffic Scene Understanding,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  X. Li, J. Li, X. Hu, and J. Yang, “Line-CNN: End-to-End Traffic Line Detection With Line Proposal Unit,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 1, pp. 248–258, 2019.
-  Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning Lightweight Lane Detection CNNs by Self Attention Distillation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1013–1021.
-  K. Kluge and S. Lakshmanan, “A deformable-template approach to lane detection,” in Proceedings of the Intelligent Vehicles Symposium. IEEE, 1995, pp. 54–59.
-  K.-Y. Chiu and S.-F. Lin, “Lane Detection using Color-based Segmentation,” in Proceedings Intelligent Vehicles Symposium. IEEE, 2005, pp. 706–711.
-  C. R. Jung and C. R. Kelber, “Lane Following and Lane Departure Using a Linear-Parabolic Model,” Image and Vision Computing, vol. 23, no. 13, pp. 1192–1202, 2005.
-  R. F. Berriel, E. de Aguiar, V. V. de Souza Filho, and T. Oliveira-Santos, “A Particle Filter-based Lane Marker Tracking Approach Using a Cubic Spline Model,” in 28th SIBGRAPI Conference on Graphics, Patterns and Images. IEEE, 2015, pp. 149–156.
-  B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, F. Mujica, A. Coates, and A. Y. Ng, “An empirical evaluation of deep learning on highway driving,” arXiv preprint arXiv:1504.01716, 2015.
A. Gurghian, T. Koduri, S. V. Bailur, K. J. Carey, and V. N. Murali,
“DeepLanes: End-To-End Lane Position Estimation using Deep Neural
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016, pp. 38–45.
-  TuSimple. TuSimple Benchmark. [Online]. Available: https://github.com/TuSimple/tusimple-benchmark
-  J. Philion, “FastDraw: Addressing the Long Tail of Lane Detection by Adapting a Sequential Prediction Network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 582–11 591.
-  K. Behrendt and R. Soussan, “Unsupervised labeled lane marker dataset generation using maps,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for
convolutional neural networks,” in
Proceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 6105–6114.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
-  R. K. Satzoda and M. M. Trivedi, “On Performance Evaluation Metrics for Lane Estimation,” in International Conference on Pattern Recognition (ICPR). IEEE, 2014, pp. 2625–2630.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.