Advanced Driver Assistance Systems (ADAS), which are a key technology for autonomous driving, assists drivers in a variety of driving scenarios owing to deep learning. For ADAS, lane detection is an essential technology for vehicles to stably follow lanes. However, lane detection tasks, which rely on visual cues such as cameras, remain challenging owing to severe occlusions, extreme changes in the lighting conditions, and poor pavement conditions. Even in such difficult driving scenarios, humans can sensibly determine the positions of lanes by recognizing the positional relationship between the vehicles and surrounding environment. This remains a difficult task in image-based deep learning.
The most widely used lane detection approach in image-based deep learning is segmentation-based lane detection [neven2018towards, pan2017spatial, hou2019learning, ghafoorian2018gan, lee2017vpgnet, liu2020lane, chang2019multi, lo2019multi, chen2018efficient]. These works learn in an end-to-end manner whether each pixel of the image represents the lane. However, it is very difficult to segment lane areas that is not visible by occlusion. To solve this problem, the network must capture the scene context with sparse supervision. Therefore, some works [pan2017spatial, hou2019learning] also introduce message passing or attention distillation. In [ghafoorian2018gan], adversarial learning was applied to generate lanes similar to the real one. These approaches can capture sparse supervision or sharpen blurry lanes. However, segmenting every pixel to detect lanes can be computationally inefficient.
To simplify the lane detection process and increase efficiency, some works [qin2020ultra, yoo2020end, chougule2018reliable] consider the problem of lane detection a relatively simple task and adopt the classification method. In [qin2020ultra], a very fast speed was achieved by dividing the image into a grid of a certain size and determining the position of the lane with row-wise classification. However, these methods do not represent lanes accurately, nor do they detect relatively large numbers of lanes.
To address the shortcomings of the semantic segmentation and classification methods described earlier, we propose a novel self-attention module called the Expanded Self Attention (ESA) module. Our modules are designed for segmentation-based lane detection and can be attached to any encoder-decoder-based model. Moreover, it does not increase the inference time. To make the model robust to occlusion and difficult lighting conditions, ESA module aims to extract important global contextual information by predicting the occluded location in the image. Inspired by the simple geometry of lanes, ESA modules are divided into HESA (Horizontal Expanded Self Attention) and VESA (Vertical Expanded Self Attention). HESA and VESA extract the location of the occlusion by predicting the confidence of the lane along the vertical and horizontal directions, respectively. Since we do not provide additional supervisory signals for occlusion, predicting occlusion location by the ESA module is a powerful help for the model to extract global contextual information. Details of the ESA module will be presented in Section 3.2.
Our method is tested on three popular datasets (TuSimple, CULane and BDD100K) containing a variety of challenging driving scenarios. Our approach achieves state-of-the-art performance in the CULane and BDD100K datasets, especially in CULane, surpassing the previous methods with a F1 score of 74.2. We confirm the effectiveness of the ESA module in various comparative experiments and demonstrate that our method is robust under occlusion and extreme lighting conditions. In particular, the results in Figure 1 show that our module shows impressive lane detection performance in various challenging driving scenarios.
Our main contributions can be summarized as follows:
We propose a new Expanded Self Attention (ESA) module. The ESA module remarkably improves the segmentation-based lane detection performance by extracting global contextual information. Our module can be attached to any encoder-decoder-based model and does not increase inference time.
Inspired by the simple lane geometry, we divide the ESA module into HESA and VESA. Each module extracts the occlusion position by predicting the lane confidence along the vertical and horizontal directions. This makes the model robust in challenging driving scenarios.
The proposed network achieves state-of-the-art performance for the CULane [pan2017spatial] and BDD100K [yu2018bdd100k] datasets and outstanding performance gains under low-light conditions.
2 Related Work
The use of deep learning for lane detection has been increasingly popular. Owing to the success of deep learning in the computer vision field, many studies have been proposed by adopting deep learning technique on lane detection for advanced driving assistant system, particularly for autonomous driving[neven2018towards, pan2017spatial, hou2019learning, qin2020ultra, yoo2020end]. This approach performs better than hand-crafted methods [ek2004lane, sun2006hsi, wang2000lane, kim2008robust]. There are two main deep-learning-based approaches: 1) classification-based and 2) segmentation-based approaches.
The first approach considers lane detection a classification task [qin2020ultra, yoo2020end, chougule2018reliable]. Some works [qin2020ultra, yoo2020end] applied row-wise classification for the detection of lanes, thereby excluding unnecessary post-processing. In particular, [qin2020ultra] achieved high-speed performance by lightening the model. However, in the classification method, the performance depends on how many times the position of the lane is subdivided. In addition, it is difficult to determine the shape of the lane accurately.
Another approach to lane detection is to consider it a semantic segmentation task [neven2018towards, pan2017spatial, hou2019learning, hou2020inter, lee2017vpgnet]. Neven [neven2018towards] performs instance segmentation by applying a clustering method to line mark segmentation. Moreover, Lee [lee2017vpgnet] proposes multi-task learning that simultaneously performs grid regression, object detection, and multi-label classification guided by the vanishing point. Multi-task learning provide additional supervisory signals. However, the additional annotations required for multi-task learning are expensive. Pan [pan2017spatial] applies a message passing mechanism between adjacent pixels. This method overcomes lane occlusion caused by vehicles and obstacles on the road and recognizes lanes in low-light environments. However, this message passing method requires considerable computational cost. To solve the slow speed of the method in [pan2017spatial], Hou [hou2019learning] proposes the Self Attention Distillation (SAD) module and achieve a significant improvement without additional supervision or labeling while maintaining the number of parameters in the model. However, in the SAD module, knowledge distillation is conducted from deep to shallow layers, which only enhances the inter-layer information flow for the lane area and does not provide an additional supervisory signal for occlusion. Our work is similar to [hou2019learning], in that it uses the self-attention module. However, it adopts a new self-attention approach in a completely different way. To overcome occlusion problems, the proposed ESA module calculates the confidence of the lane that is deeply related to the occlusion. By using lane confidence, the model can reinforce the learning performance for these areas by providing a new supervisory signal for occlusion.
Self-attention has provided significant improvements in machine translation and natural language processing. Recently, self-attention mechanisms are used in various computer vision fields. The non-local block[wang2018non] learns the relationship between pixels at different locations. For instance, Zhang [zhang2019self] introduces a better image generator with non-local operations, and Fu [fu2019dual] improves the semantic segmentation performance using two types of non-local blocks. In addition, self-attention can emphasize important spatial information of feature maps. [park2018bam, woo2018cbam] showed meaningful performance improvement in classification by adding channel attention and spatial attention mechanisms to the model.
The proposed ESA module operates in a different way than the previously presented module. The ESA module extracts the global context of congested roads to predict areas with high lane uncertainty and to emphasize those lanes. The results of various driving scenarios are covered in Section 4.1.
3 Proposed Approach
Unlike general semantic segmentation, lane segmentation conducts segmentation by predicting the area in which the lane is covered by objects. Therefore, lane segmentation tasks must extract global contextual information and consider the relationship between distant pixels. In fact, self-attention modules with non-local operation [wang2018non] can be an appropriate solution. Several works [zhu2019asymmetric, fu2019dual, huang2019ccnet] prove that non-local operations are effective in semantic partitioning where global contextual information is important. However, in contrast to the complex shape in general semantic segmentation, the lane has a relatively simple geometric shape in lane segmentation. This makes non-local operations inefficient.
If the network can extract occluded locations, lanes that are invisible owing to occlusions are easier to segment. The location information of occlusions becomes more important than their shape owing to the simple lane shape. Therefore, rather than extracting the high-level occlusion shape, it is more effective to extract the low-level occlusion position. By using this positional information, the ESA module can extract the column or row-wise confidence of lanes by itself. The confidence indicates that the model knows the location of the occlusion based on the global contextual information of the scene.
3.2 Expanded Self Attention
The ESA module aims to extract global contextual information by recognizing the occluded area. The structure of the ESA module is inspired by the fact that the lane is a line that spreads from the vanishing point. Due to the simple shape of the lane, it is efficient to predict the confidence along the vertical or horizontal direction of the lane in order to estimate the location of the occlusion. Therefore, we divide the ESA module into HESA and VESA according to the direction to extract the lane confidence.
Figure 2 shows two types of ESA modules, HESA and VESA. Both modules have an ESA encoder consisting of convolution layers and fully connected layers. The ESA encoders of the HESA and VESA modules are defined as and
, respectively. The only difference between the two encoders is the length of the output vector. For the HESA modules, the output shape ofis , where is the maximum number of lanes, and is the height of the image. This output will be expanded horizontally and will be equal to the original image size. The expanded matrix is ESA matrix, . It should be noted that each row of has the same element value, as shown in Figure 2. Similarly, regarding the VESA module, the output of of size is vertically expanded to ensure that the ESA matrix is , where is the width of the image. Therefore, as illustrated in Figure 2, each column of has the same value. The ESA matrix has a value between 0 and 1 owing to the sigmoid layer of and highlights a part of the predicted probability map via the element-wise product between the predicted probability map and ESA matrix. If the predicted probability map is , the weighted probability map is formulated as for the HESA module and for the VESA moduel, where the operator describes an a element-wise product.
The most important role of the ESA module is extracting lane confidence. Figure 4 presents the predicted probability map of the model and output of the ESA encoder . The colors in the graph match the colors of the lane. The output of is identical to the height of the image. However, in Figure 4, only the location in which the lane exists is presented as a graph. If there is no occlusion on the road as shown in the first figure in Figure 4, the output of is overall high. If occlusion occurs, such as the blue and yellow lanes in the second figure, the measured value of the occluded area is small. This is how the ESA module measures the confidence of the lane. If the visual cues for the lane are abundant, the lane confidence at the location increases, and a great weight is output. Conversely, if there are few visual cues, the lane confidence decreases and a small weight is output.
3.3 Network Architecture
Our network architecture is illustrated in Figure 3. Our neural network starts with the baseline model, which consists of encoder and decoder. In this paper, since inference time is an important factor in lane detection, lightweight baseline models such as ResNet-18 [he2016deep], ResNet-34 [he2016deep], and ERFNet [romera2017erfnet] are used. Inspired by the works [hou2019learning, liu2020lane]
, we add the existence branch to the baseline model. Existence branch is designed for datasets in which lanes are classified according to their relative position, such as TuSimple and CULane. In the case of BDD100K, existence branch is not used because we consider all lanes as one class. We extract a total of four feature maps from the baseline model encoder. These feature maps are resized and concatenated to become input to the ESA module. We will discuss in detail how the ESA module output, baseline model output, and ground truth labels interact with each other in Section3.4.
3.4 Objective Functions
Segmentation and existence loss. First we reduce the difference between the predicted lane segmentation map and the ground truth segmentation map . The segmentation loss is used as follows:
where is the standard cross entropy loss. In addition, the existence loss is proposed for the TuSimple and CULane datasets because lanes are classified by their relative positions. The existence loss is formulated as follows:
where is the binary cross entropy loss, is a lane existence label, and is an output of the lane existence branch.
ESA loss. The ESA module aims to predict the confidence of the lane by recognizing occlusion with global contextual information. However, creating an annotation for the location information of the occlusion is time-consuming and expensive, and the consistency of the annotation cannot be guaranteed. Therefore, our module learns the occlusion location without additional annotations by reducing the mean square error between the weighted probability map and the weighted ground truth segmentation map . Figure (a)a presents this process.
The predicted probability map of the lane is , where is the softmax operator. In addition, the ESA loss is formulated as follows:
where the ESA matrix is , the weighted probability map , the weighted ground truth map , and is the mean square error loss. Moreover, the operator calculates the average of all values of the feature map, and is a regularization coefficient. The coefficient has an important effect on the performance of the model, and it determines the proportion of the weighted lane area.
The first term on the right-hand side of Equation (3) is visualized in Figure (a)a. In general, the lane probability map is blurred in areas with sparse supervisory signals. As shown in Figure (a)a, if a large weight is given to the accurately predicted region in the probability map, the mean square error is small. Conversely, when a large weight is given to an uncertainly predicted area, the mean square error is large. This is how to predict the confidence of the lane without additional annotations.
In fact, if only the mean square error loss is used as the ESA loss, the ESA module outputs are all zeros in the training. To solve this problem, a second term is added as a regularizer to the right-hand side of Equation (3). This regularization term keeps the average pixel value of the weighted probability map equal to a certain percentage of the average pixel value of the ground truth map. This ratio is determined by , which has a value between 0 and 1.
It should be noted that although one ESA module is an HESA or a VESA module, both modules can be simultaneously attached to the model. In that case, the ESA loss is , where is the ESA loss of the HESA module, and is the ESA loss of the VESA module.
Finally, the above losses are combined to form the final objective function:
The parameters , and balance the segmentation loss, existence loss, and ESA loss of the final objective function.
|Category||R-34-HESA||ERFNet-HESA||SCNN [pan2017spatial]||ENet-SAD [hou2019learning]||R-34-Ultra [qin2020ultra]||ERFNet [romera2017erfnet]||ERFNet-E2E [yoo2020end]|
Datasets. We use three popular lane detection datasets TuSimple [tusimple], CULane [pan2017spatial], and BDD100K [yu2018bdd100k] for our experiments. TuSimple datasets consist of images of highways with constant illumination and good weather, and are relatively simple datasets because the roads are not congested. Therefore, various algorithms [pan2017spatial, neven2018towards, ghafoorian2018gan, hou2019learning, jung2020towards] have been tested on TuSimple datasets since before 2018. CULane is a very challenging dataset that contains crowded environments with city roads and highways with varying lighting conditions. The BDD100K dataset also consists of images captured under various lighting and weather conditions. In addition, the largest number of lanes among the three datasets is labeled. However, because the number of lanes is large and inconsistent, we detect lanes without distinguishing instances of lanes.
1) TuSimple. In accordance with [tusimple], the accuracy is expressed as , where is the number of predicted correct lane points and is the number of ground truth lane points. Furthermore, false positives (FP) and false negatives (FN) in the evaluation index.
2) CULane. In accordance with the evaluation metric in [pan2017spatial], each lane is considered 30 pixel thick, and the intersection-over-union (IoU) between the ground truth and prediction is calculated. Predictions with IoUs greater than 0.5 are considered true positives (TP). In addition, the F1-measure is used as an evaluation metric and is defined as follows:
where , .
3) BDD100K. In general, since there are more than 8 lanes in an image, following [hou2019learning], we determine the pixel accuracy and IoU of the lane as evaluation metrics.
Implementation details. Following [pan2017spatial, hou2019learning], we resize the images of TuSimple, CULane, and BDD100K to , , and , respectively. The original BDD100K images label one lane with two lines. Because this labeling method is difficult to learn, so we drew new 8 pixel thick ground truth labels that pass through the center of the lane. The new ground truth labels are applied equally to both train and test sets. Moreover, SGD [bottou2010large] is used as the optimizer, and the initial learning rate and batch size are set to 0.1 and 12, respectively. The loss balance coefficients , , and in Equation (4) are set to 1, 0.1, and 50, respectively. The regularization coefficient in Equation (3) is 1. It is experimentally verified whether the value of the coefficient in Equation (3) has a significant effect on the performance of the model. In CULane and BDD100K, the optimal value is set to 0.8, and TuSimple is set to 0.9. The effect of on the performance is discussed in detail in Section 4.2
. Because the BDD100K experiment regards all lanes as one class, the output of the original segmentation branch is replaced with a binary segmentation map. In addition, the lane existence branch is removed for the evaluation. All models are trained and tested with PyTorch and the Nvidia RTX 2080Ti GPU.
Tables 1-3 compare the performance results of the proposed method and previously presented state-of-the-art algorithms for CULane, TuSimple, and BDD100K datasets. The proposed method is evaluated with the baseline models ResNet-18 [he2016deep], ResNet-34 [he2016deep], and ERFNet [romera2017erfnet], and each model is combined with either an HESA or a VESA. Moreover, the use of both HESA and VESA modules is denoted as “H&VESA”. The effects of using both modules simultaneously are presented in Section 4.2.
The combination of the baseline model ERFNet and ESA module outdoes the performance of the ERFNet and achieves state-of-the-art performance for CULane and BDD100K. In particular, ERFNet-HESA provides significant performance gains for almost all driving scenarios in the CULane dataset compared to ERFNet. However, the runtime and number of parameters remain unchanged. In addition, ERFNet-HESA surpasses the existing methods by achieving an F1-measure of 69.2 in the challenging low-light environment in the lane detection with the CULane dataset. It has a fast runtime similar to those of the previous state-of-the-art methods in Table 1. Thus, the proposed method is much more efficient than the previously proposed methods. As shown in Table 3, compared to ERFNet, ERFNet-HESA increases accuracy from 55.36% to 57.47% with the BDD100K dataset. In addition, ERFNet-H&VESA achieves the highest accuracy of 60.24%. These results show that the HESA and VESA modules work complementarily. The regarding details are covered in Section 4.2. The results of the TuSimple dataset in Table 2 show the effect of the ESA module, but it does not achieve the highest performance. The TuSimple dataset contains images of highways with bright light, and generally less occlusion. Because the ESA module extracts global contextual information by predicting the occluded location, our method is less effective for datasets with less occlusion.
We provide qualitative results of our algorithm for various driving scenarios in three benchmarks. In particular, the first and second rows of Figure 6 show that our method can detect sharp lanes even under extreme lighting conditions and in situations in which the lanes are barely visible owing to other vehicles. Figure 7 (a) shows that the ESA module can connect the lanes occluded by vehicles without interruption. According to Figure 7 (b), the approach achieves more accurate lane detection in low-light environments. Thus, compared to the baseline model, the ESA module can improve performance in challenging driving scenarios with extreme occlusion and lighting conditions.
4.2 Ablation Study
Combination of HESA and VESA. Table 4 summarizes the performance characteristics of different combinations of HESA and VESA. The following observations can be made. (1) The performance characteristics of the HESA and VESA modules are similar. (2) In general, the performance of H&VESA with HESA and VESA modules applied simultaneously is better. In addition, H&VESA results in a remarkable performance improvement for BDD100K. The reason why the HESA and VESA modules lead to similar performance characteristics is that the predicted direction of the lane confidence is not important for extracting the low-level occlusion location because the lane has a simple geometric shape. Because the HESA and VESA modules complement each other to extract more abundant global contextual information, it is not surprising that H&VESA generally achieves the highest performance. Therefore, global contextual information is more important for the BDD100K dataset, which includes many lanes.
Value of . Figure 8 compares the total F1-score of the CULane dataset with respect to in Equation (3). As shown in Figure 8, the model shows the best performance at in ERFNet-HESA. It is important to find a suitable value because it determines the ratio of occluded and normal areas. When the is small (, when the predicted occlusion area is wide), the sensitivity to occlusion decreases, which makes it difficult to determine the occluded location accurately. Conversely, when
is large, the detected occlusion area becomes narrow, which makes it difficult for the network to reinforce learning for the entire occluded area.
This paper proposes ESA module, a novel self-attention module for robust lane detection in occluded and low-light environments. The ESA module extracts global contextual information by predicting the confidence of the lane. The proposed module can be applied to any encoder-decoder-based model and does not increase the inference time. The performance of the model is evaluated on the datasets containing a variety of challenging driving scenarios. According to the results, our method outperforms previous methods. We confirm the effectiveness of the ESA module in various comparative experiments and demonstrate that our method is robust in challenging driving scenarios.