Lane and road markings are critical elements in traffic scenes. The lane lines or road signs such as arrows can provide valuable information for planning the vehicle trajectory or controlling its driving behavior. Thus, accurate lane marking detection is of great importance for self-driving cars.
Current lane marking detection methods mostly utilize the segmentation technique , which is based on fully convolutional deep neural networks (FCNs). The segmentation networks rely on local features which are extracted from the raw RGB pattern and mapped into semantic spaces for pixel-level classification. However, such an architecture often suffers from an accuracy degradation for lane and road markings far away from the ego-vehicle, because distant lane and road markings occupy a small number of pixels in the image, and their features become inconsistent for varying distances and perspectives, as shown in Fig. 1. This also imposes a negative effect on the performance of autonomous driving, since both distant and close lane marking information is important for the controlling and planning tasks.
An intuitive solution to the above problem is to transform the original image to a bird’s-eye view (BEV) using the Inverse Perspective Mapping (IPM). In principle, this can get rid of the perspective distortion in the image and solve the problem of inconsistent scales of lane and road markings at different distances. However, the IPM is typically implemented by interpolation which seriously reduces the resolution of the distant road surface in the image and create unnatural blurring and stretching (Fig. 2), leading to a negative impact on the final detection accuracy. To tackle this issue, we adopt the Encoder-Decoder architecture in Fully Convolutional Networks  and leverage the idea of Spatial Transformer Networks  to build a semantic segmentation neural network. As shown in Fig. 3, Fully Convolutional layers are interleaved with a series of differentiable homographic transform layers, which are called "Perspective Transformer Layers" (PTLs) and can transform the multi-channel feature maps from the original view to the bird’s-eye view during the encoding process. Afterwards, it transforms feature maps back to the original perspective in the decoding process, where subsequent convolutional layers are employed to refine each interpolated feature map. Therefore, this network can still use the labels in the original view for an end-to-end training.
In this work, our contributions can be summarized as follows:
We proposed a lane marking detection network based on FCN, which integrates with novel PTLs to reduce the perspective distortion at a distance.
We build a mathematical model to derive the parameters of consecutive PTLs, enabling the mutual conversion between the original view and the bird’s eye view step-wisely.
Ii Related Works
Lane marking detection has been intensively explored and recent progresses have mainly focused on the semantic segmentation-based and instance segmentation-based methods.
Ii-a Lane Marking Detection by Semantic Segmentation
The emerging lane and road marking datasets enable the introduction of deep semantic segmentation-based methods into the lane marking detection task [9, 10]. The work  proposed both a road marking dataset and a segmentation network using the ResNet with Pyramid Pooling. Lee et al.  proposed a unified end-to-end trainable multi-task network that jointly handles lane marking detection and road marking recognition under adverse weather conditions with the guidance by a vanishing point. Zhang et al.  proposed a segmentation-by-detection method for road marking extraction, which delivers outstanding performances on cross datasets. In this method, a lightweight network is dedicatedly designed for road marking detection. However, the segmentation is mainly based on conventional image morphological algorithms.
Ii-B Lane Marking Detection by Instance Segmentation
Semantic segmentation is essentially just a pixel-level classification problem, it neither can distinguish different instances within the same category, nor can interpret separated parts of the same marking (dashed lines, zebra lines, etc.) as a unity. Therefore, researchers’ attention gradually shifted to study the problem of instance segmentation.
Pan et al.  proposed the Spatial CNN (SCNN), which generalizes traditional spatial convolutions to slice-wise convolutions within feature maps, thus enabling message passing between pixels across rows and columns in a layer. This is particularly suitable for long continuous shape structure or large objects with strong spatial relationship but less appearance clues, such as traffic lanes. Hsu et al.  proposed a novel learning objective function to train the deep neural network to perform an end-to-end image pixel clustering. They applied this approach on instance segmentation, and used the pairwise relationship between pixels for supervision. Neven et al.  went beyond the modelling limitation by pre-defined number of lanes, and proposed to cast the lane detection problem as an instance segmentation problem, in which each lane forms its own instance that can be trained end-to-end. To parameterize the segmented lane instances before the lane fitting, they further proposed to apply a learned perspective transform, which is conditioned on the image and called H-Net.
Ii-C Perspective Transform in CNNs
The accuracy of existing lane marking detection methods often degrades at distant markings due to the inconsistent scales caused by perspective projection. To compensate the perspective distortion, a spatial transform, such as IPM, should be involved. A typical work to implement such a transform using neural network is the Spatial Transformer Network . It introduces a learnable module, i.e., the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be applied to existing convolutional architectures, enabling actively spatial transform of feature maps.
The most similar work to ours is . In this work, an adversarial learning approach is proposed for generating an improved IPM using the STN  based on a single image. The generated BEV images contain sharper features (e.g., lane and road markings) than that produced by traditional IPM. The main difference between this work and ours is that they took a ground-truth BEV image (obtained by visual odometry) for supervision and trained their network with a GAN loss. Their target is to generate a high-resolution IPM, while ours is to improve the segmentation accuracy. Besides, they apply STN layers at the bottleneck of the encoder-decoder network, while our PTLs are interleaved with the convolutional and downsampling layers, which can utilize subsequent convolutional layers to refine each interpolated feature map.
Iii Proposed Method
In this work, we boost the performance of lane marking detection by inserting differentiable PTLs into the standard encoder-decoder architecture. One challenge in designing Transformer layers lies in dividing and distributing the integral transform into several even steps. Another is about how to determine the proper cropping range for these intermediate views. In this section, we firstly describe the improved backbone in section III-A. Then, we address how to apply Transformer layers as well as how to solve above difficulties in section III-B
. Finally, we illustrate the deployment of the backbone in both semantic and instance segmentation context with details about detection heads and loss functions dedicated to these tasks inIII-C.
Iii-a Network Structure
. In this structure, PTLs interleave with the convolutional and down-sampling layers. We refer our network as TPSeg. In this network, images go through the encoder and are down-sampled to a feature map with 5 times of the stride-2 down-sampling operation. And the feature map is gradually warped into a pseudo BEV. Afterwards, the decoder reverts the previous transforms by up-sampling and back-projecting the feature map into its original size and perspective, while keeping the accumulated high-level semantic information of lane and road markings.
Mathematically, the PTL is nothing different than a linear transformation of the coordinates and a bi-linear sampling of the feature map. And the sampling procedure does not affect the overall differentiability, so that the network with PTL can be trained in an end-to-end manner. For a better understanding, we use a RGB image instead of feature maps and give a qualitative visual description of the yielded results by PTLs in Fig.4.
Since feature maps in the middle of the convolutional neural network have perspective transform relationships with the input image, this transform is equivalent to warping the input image to a BEV for training and detection, thus solving the problem of inconsistent scales of lane and road markings due to different distances. Meanwhile, the subsequent refinement reduces blur and artifacts caused by interpolation. Similar to the FCN, we also add skip-connections to merge feature maps from the up- and down-sampling layers of the same size. This can compensate for the information loss during the down-sampling, resulting in clear boundaries for detected lane and road markings.
Iii-B Consecutive Perspective Mapping
In order to map the front-view image captured by the vehicle-mounted camera into a bird’s-eye view smoothly, we adopt an approach differing from the standard IPM method. Here we decompose the integral transform into a series of shortest-path consecutive transforms ( for short) that project the view into view . This procedure is interpreted as
where (can be denoted as for short) is the rotation matrix by which virtual camera is rotated in relation to virtual camera ;
is the translation vector fromto ; and are the normal vector of the ground plane and the distance to the plane respectively. and are the cameras’ intrinsic parameter matrices.
However, to control the transform process, the value of internal parameters, i.e., and should be selected for each by trial and error, which is a tedious job. To simplify this process, we use a pure rotation virtual camera model to eliminate ,
, and use a Key-Point Bounding-Box Trick to estimatefor optimal viewports of intermediate feature maps. Whereas the traditional IPM uses at least 4 pairs of pre-calibrated correspondences on each view to estimate the integral directly, we estimate the integral rotation by the horizon line specified on the image, which can be obtained by horizon line detection models, e.g., the HLW. By representing the rotation in the axis-angle form, it is much easier to divide the rotation into sections by dividing the angle and keep the axis direction unchanged. In this way, all internal parameters of each are determined. Details about above procedure are given as follows.
Iii-B1 Pure Rotation Virtual Cameras
It can be proven that a translated camera with unchanged intrinsic matrix can produce the same image as a fixed camera with accordingly modified intrinsic matrix. Thus, the consecutive perspective transform is modeled as synthesizing the ground plane image captured by a pure rotating camera, and (1) is simplified as,
and only the rotation matrix should be decomposed as
Iii-B2 Estimating Integral Extrinsic Rotation by the Horizon Line
As extrinsic matrices with respect to the ground plane are not provided in TuSimple and ApolloScape datasets, we roughly estimate the integral rotation by the horizon line. Given two horizon points in the camera coordinates, and , the normal vector of ground plane (facing to the ground) is calculated by a cross-production, i.e.,
In order to rotate the camera to face to the ground, its -axis should be rotated to align with the normal vector . Hence, the rotation in axis-angle form is calculated as
where is a unit vector on -axis.
Iii-B3 Decomposing the Extrinsic Rotation
Here we use the axis-angle representation for decomposing the rotation. We simply divide the integral angle into several even parts, and then convert each to the corresponding rotation matrix .
Iii-B4 Optimal Viewports by Key-Point Bounding Boxes
While conducting IPM, image pixels at the edge often need to be cropped to prevent the target view from being too large. In order to preserve the informative pixels as many as possible, we roughly annotate the ground region by a set of border points in the front-view. The points are projected to the new view during each perspective transform. And we use a bounding box in the new view to determine the minimal available viewport which does not crop any projected key point. Thus, given a desirable target view width , the corresponding intrinsic and target view height is determined, as shown in Algorithm 1.
Iii-C Segmentation Heads
Iii-C1 Semantic Segmentation
. In this work, we adopt a FCN-like network. By representing label classes as one-hot vectors, we predict the logits of each class at each pixel location. Then, we use the classic cross-entropy loss function to train this semantic segmentation branch.
Iii-C2 Instance Segmentation
Semantic segmentation neither can distinguish different instances within the same category, nor can interpret separated parts of the same marking (dashed lines, zebra lines, etc.) as a unity. In order to solve that issue, we follow the work of LaneNet  to interpret the lane detection problem as an instance segmentation task. The network contains two branches. The semantic branch outputs a binary mask, while the instance embedding branch outputs an N-dimensional embedding vector for each pixel. The instance embedding branch is to disentangle the lane pixels identified by the semantic branch. Here we also use a one-shot method based on distance metric learning . During training, pixels from the same instance are "pulled" close to each other, and those from different instances are "pushed" away from each other by the loss function. During inference, the unmasked pixels are clustered in the embedding space to output the final result. For details of the loss function please refer to .
dataset respectively for the instance segmentation and semantic segmentation tasks. Our network is implemented by the PyTorch framework.
Iv-a TuSimple Benchmark
The TuSimple Benchmark is a dedicated dataset for lane detection and consists of 3626 training and 2782 testing images, under good and medium weather conditions. The image resolution of this dataset is 1280x720 pixels. The annotation includes the -position of the lane points at a number of discretized -positions.
The detection accuracy is calculated as the average correct number of points per image:
where denotes the number of correct points and is the number of groundtruth points. A point is regarded as correctly detected when the difference between the groundtruth and the predicted point is smaller than a predefined threshold. Together with the accuracy, the false positive and false negative scores can also be calculated by
where denotes the number of mispredicted lanes, indicates the number of predicted lanes, is the number of missed groundtruth lanes and represents the number of all groundtruth lanes.
Iv-A3 Training Details
Here we regard the lane detection as an instance segmentation task and train the instance segmentation network as shown in Fig. 3. Considering that the two classes (lane/background) are highly unbalanced, we apply the bounded inverse class weighting . During the training process we use the Adam 
optimizer, with a weight decay of 0.0005, a momentum of 0.95, a learning rate of 0.00004, and a batch size of 2. When the accuracy is without promotion up to 60 epochs, the learning rate drops to 10%. And the model converges after 220 epochs.
Iv-A4 Evaluation Results
In comparison with other state-of-the-art methods   , we show the test results in Table I, from which we can find out that our detection accuracy is already in the first echelon. It is worth to mention that all evaluation results above are in strict accordance with the metric defined by TuSimple. However, in our method the feature maps are warped into a bird’s-eye view of the ground, which force a part of the image above the horizon to be ignored by our method, and would lead to a slight decrease of the results. In order to make a fair comparison, we re-evaluated those samples only below the horizon, and ignored null samples which is labeled as in the annotation. The new evaluation result is named as shown in the rightmost column of the Table I. We also plot the accuracy versus different distances from ego-vehicle in the line charts.
|Xingang Pan ||96.53||0.0617||0.018||yes||N/A|
|Yen-Chang Hsu ||96.50||0.0851||0.0269||no||N/A|
|Davy Neven ||96.40||0.078||0.0244||no||N/A|
Fig. 5 (a) shows the accuracy in dependence of pixel distance to the image bottom. Fig. 5 (b) shows the accuracy in dependence of the real distance to the ego-vehicle. Both charts imply that our method can improve the detection accuracy of lane and road markings at longer distances. The qualitative comparison is shown in Figure 6.
Iv-B ApolloScape Benchmark
ApolloScape is a large scale dataset and contains seven branches for different tasks in the field of autonomous driving. Among them, the Lane Segmentation branch contains a diverse set of stereo video sequences recorded in street scenes from different cities, with high quality pixel-level annotations of more than 110,000 frames. Images in ApolloScape dataset are at a resolution of 3384x2710 pixels. The annotation information includes 35 kinds of lane and road markings from daily traffic scenarios, including but not limited to lanes, turning arrows, stop lines, zebra crossings. To the best of the authors’ knowledge, no related works have been trained on the ApolloScape Lane Segmentation dataset. Therefore, we only show the ablation experimental results of our own method.
The evaluation follows the recommendation of ApolloScape which uses the mean-IOU (mIOU) as the evaluation metric just like in. For each class, given the predicted masks and the groundtruth for image and class , the evaluation metric is defined as:
Iv-B3 Training Details
Since the ApolloScape only provides pixel-level semantic annotations instead of instance information, we train the semantic segmentation network as shown in Fig. 3. Here we split 6288 frames out from the training set for validation. During the training process we use the Adam  optimizer, with a weight decay of 0.0005, a momentum of 0.95, a learning rate of 0.00004, and a batch size of 2. And the model converges after 25 epochs.
Iv-B4 Evaluation Results
Also in a fair way, we evaluate the image part below the horizon. Fig. 7 shows the results of mean-IOU accuracy at different distances. Table II shows mIOU value and IOU values of some common types of lane and road markings. We ignored the rest classes whose frequency is less than 0.001.
|thru & left turn||0.768||0.800|
|thru & right turn||0.824||0.808|
|yellow double solid||0.886||0.893|
According to the experimental results, our method can effectively improve the detection accuracy of road markings at further distances, especially for the lane and road markings with richer structural features such as turning arrows. The qualitative comparison is shown in Fig. 8.
In this paper, we introduced a segmentation network architecture improved by consecutive homographic transforms for ground plane road marking detection. The parameters of consecutive transforms are clearly yielded by a pure rotating camera model and a key-point bounding-box trick. The proposed method is proven to be beneficial for distant lane and road marking detection. For the future research, we are going to incorporate an online scheme of extrinsic estimation into this structure. Also handling the non-flat ground surface which has unclear definition of ground normal vectors and horizon lines is one of the interesting topics.
This work is supported by the National Key Research and Development Program of China (No. 2018YFB0105103, No. 2017YFA0603104), the National Natural Science Foundation of China (No. U1764261, No. 41801335, No. 41871370), the Natural Science Foundation of Shanghai (No. kz170020173571, No. 16DZ1100701) and the Fundamental Research Funds for the Central Universities (No. 22120180095).
-  Seokju Lee, Junsik Kim, Jae Shin Yoon, Seunghak Shin, Oleksandr Bailo, Namil Kim, Tae-Hee Lee, Hyun Seok Hong, Seung-Hoon Han, and In So Kweon. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. arXiv:1710.06288 [cs], Oct 2017. arXiv: 1710.06288.
-  Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Towards end-to-end lane detection: an instance segmentation approach. arXiv:1802.05591 [cs], Feb 2018. arXiv: 1802.05591.
Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Spatial as deep: Spatial cnn for traffic scene understanding.2018.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In , pages 3431–3440, 2015.
-  Max Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatial transformer networks. page 9.
-  Tusimple lane detection benchmark. http://benchmark.tusimple.ai/.
-  Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for autonomous driving. arXiv: 1803.06184, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
-  Xiaolong Liu, Zhidong Deng, Hongchao Lu, and Lele Cao. Benchmark for road marking detection: Dataset specification and performance baseline. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pages 1–6. IEEE, 2017.
-  Y. Wu, T. Yang, J. Zhao, L. Guan, and W. Jiang. Vh-hfcn based parking slot and lane markings segmentation on panoramic surround view. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1767–1772, June 2018.
-  Weiwei Zhang, Zeyang Mi, Yaocheng Zheng, Qiaoming Gao, and Wenjing Li. Road marking segmentation based on siamese attention module and maximum stable external region. IEEE Access, 7:143710–143720, 2019.
-  Yen-Chang Hsu, Zheng Xu, Zsolt Kira, and Jiawei Huang. Learning to cluster for proposal-free instance segmentation. arXiv:1803.06459 [cs], Mar 2018. arXiv: 1803.06459.
-  Tom Bruls, Horia Porav, Lars Kunze, and Paul Newman. The right (angled) perspective: Improving the understanding of road scenes using boosted inverse perspective mapping. In 2019 IEEE Intelligent Vehicles Symposium (IV), page 302–309, Jun 2019.
-  Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wild. In Procedings of the British Machine Vision Conference 2016, pages 20.1–20.12. British Machine Vision Association, 2016.
-  Bert De Brabandere, Davy Neven, and Luc Van Gool. Semantic instance segmentation with a discriminative loss function, 2017.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.