LDNet: End-to-End Lane Detection Approach usinga Dynamic Vision Sensor

09/17/2020 ∙ by Farzeen Munir, et al. ∙ 0

Modern vehicles are equipped with various driver-assistance systems, including automatic lane keeping, which prevents unintended lane departures. Traditional lane detection methods incorporate handcrafted or deep learning-based features followed by postprocessing techniques for lane extraction using RGB cameras. The utilization of a RGB camera for lane detection tasks is prone to illumination variations, sun glare, and motion blur, which limits the performance of the lane detection method. The incorporation of an event camera for lane detection tasks in the perception stack of autonomous driving is one of the most promising solutions for mitigating challenges encountered by RGB cameras. In this work, Lane Detection using dynamic vision sensor (LDNet), is proposed, that is designed in an encoder-decoder manner with an atrous spatial pyramid pooling block followed by an attention-guided decoder for predicting and reducing false predictions in lane detection tasks. This decoder eliminates the implicit need for a postprocessing step. The experimental results show the significant improvement of 5.54% and 5.03% on the F1 scores in the multiclass and binary class lane detection tasks, respectively. Additionally, the IoU scores of the proposed method surpass those of the best-performing state-of-the-art method by 6.50% and 9.37% in the multiclass and binary class tasks, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 6

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Advancements in the development of sensor technology have made a tremendous impact on autonomous driving in terms of environmental perception [1]. The primary goal is to understand the environment surrounding the autonomous vehicle through the fusion of exteroceptive and proprioceptive sensor modalities [2]. The perception of the surrounding environment includes many challenging tasks, for instance, lane extraction, object detection, and traffic mark recognition, which provides the foundation for the safety of autonomous vehicles as standardized by the Safety of the Intended Functionality SOTIF-ISO/PAS-21448111https://www.daimler.com/innovation/case/autonomous/safety-first-for-automated-driving-2.htm. The fundamental task in the hierarchy of perception is the extraction of lane information, as it assists an autonomous vehicle in precisely determining its position between the lanes. Accurate lane extraction forms the basis for the robust plans of autonomous vehicles, which includes lane departure and trajectory planning.

In the literature, much promising research has been proposed based on either using handcrafted features or using an end-to-end deep neural network for lane detection using conventional RGB cameras

[3] [4] [5] [6]. Conventional cameras have certain limitations, and their utilization can have an adverse effect on perception tasks [7]. For instance, by using conventional RGB cameras, the change in illumination conditions can affect the performance of the lane detection algorithm because of the unclear scene in the input. The development of event cameras provides a promising solution to overcome uncertainty in conventional RGB cameras. Event cameras (or dynamic vision sensor) are novel sensors used in different perception tasks. Event cameras have two primary characteristics: i) a low latency rate and ii) a high dynamic range. An event camera captures the environment by the change in events, and its low latency rate helps generate the image faster than conventional cameras [7]. Additionally, this characteristic ensures that the image quality is not affected by motion blur. The high dynamic range of event cameras addresses the effect of illumination. Compared to conventional cameras having a dynamic range of dB, event cameras provide a high dynamic range of dB that mitigates the illumination variation problem that appears in conventional cameras for lane detection [7] [8]. Fig. 1 illustrates the difference between event cameras and standard conventional cameras.

Fig. 1: shows a sequence of images captured while coming out of a tunnel (T1-T2-T3-T4-T5). The top row shows the grayscale images, and the bottom row shows the corresponding event camera images. RGB cameras are highly affected by illumination variations due to their low dynamic range. The figure is borrowed from [9] to illustrate the difference between event cameras and standard RGB cameras.

In this work, inspired by the utilization of event cameras in autonomous driving for lane detection tasks, as illustrated in [9], an encoder-decoder neural architecture is designed for lane detection using an event camera. The architecture of the network is composed of three main core blocks: i) an encoder, ii) an atrous spatial pyramid pooling (ASPP) block [10] [11]

, and iii) an attention-guided decoder. The encoder of the proposed network is a combination of convolutional layers followed by a DropBlock layer and a max-pooling layer for the fast encoding of the input data. The ASPP block processes the encoded feature maps for the extraction of long-range features to ameliorate the spatial loss. For the decoder, a novel attention guided-based decoder is proposed, followed by fully connected layers to produce the lane detection predictions. The proposed method is extensively tested on the event camera dataset (DET) for multiclass and binary class lane detection tasks and evaluated using the F-measure (

score) and intersection over union () metrics. The proposed method achieves a significant improvement of and on the scores in the multiclass and binary class tasks, respectively, surpassing the best-performing state-of-the-art method. In the case of the scores, the proposed method surpasses the best-performing state-of-the-art method by and in multiclass and binary class tasks, respectively.

In summary, the following are the main contributions of this work:

  1. A novel encoder-decoder architecture is designed for lane detection using an event camera that includes an ASPP block and an attention-guided decoder to make predictions.

  2. The proposed work achieved improvements in the scores of and in the multiclass and binary class tasks, respectively, in comparison to best-performing state-of-the-art method on the test set of the event camera dataset.

  3. The proposed architecture surpasses the best-performing benchmark state-of-the-art results by and on the scores in the multiclass and binary class tasks, respectively.

The remainder of the paper is organized as follows: Section II explains the related work. Section III discusses the proposed methodology. Section IV focuses on the experiments and results. The experimental analysis is discussed in Section V, and finally, section VI concludes the paper.

Ii Related Work

In autonomous driving, lane detection serves as a fundamental component, and much research has been focused on the development of robust lane detection algorithms [12]. In the literature, two types of main mainstream techniques have been used for lane detection: traditional image processing methods and deep learning-based segmentation methods [13] [14] [15].

Fig. 2: The overall architecture of LDNet, which includes three core components: i) an encoder, ii) an ASPP block and iii) an attention-guided decoder.

Traditional vision-based lane detection methods follow pipelines that include image preprocessing, feature extraction, lane model fitting and lane tracking. In traditional approaches, image preprocessing is a necessary step in determining the quality of features for lane detection tasks. For this purpose, image preprocessing includes region of interest (ROI) generation, image enhancement for extracting lane information and removal of non-lane information. The extraction of ROIs is an efficient method for reducing redundant information by selecting the lower part of the image

[16] [17] [18], and in some works, ROIs are generated using vanishing point detection techniques [26] [20] [21]. Inverse perspective mapping (IPM) [22],[23] or warp perspective mapping [24] is used after ROI generation based on the parallel line assumption to reduce the effect of noise and to conveniently extract lanes. Lane enhancement is performed by using either color-based techniques or edge detection methods, such as hue-saturation-intensity (HSI) [25], YCbCr [26], and LAB [27] as color-based models for transformation, and the Sobel operator [28],[29] and Canny detector [30] [31] as edge-based techniques. Hybrid methods comprising color and edges are also used in research [18]. ROI generation reduces the noise in images, but it is not robust to shadows and vehicles. Filters are used in some works to eliminate non-lane information [32] [33] [16]. In traditional approaches, lanes can also be modeled in the form of lines[34] [32], parabolas[35][29], splines[32][14] [36], hyperbolas[21]

, and so on. Additionally, tracking is used as the postprocessing step to overcome illumination variations. Kalman filtering and particle filtering are the most widely used approaches for tracking lane detection

[34] [24]. In addition to tracking, the authors also utilized Markov and conditional random fields as a postprocessing approach for lane detection [37]. [38] used a normal map for lane detection. The authors utilized the depth information for the generation of normal maps and used adaptive threshold segmentation for lane extraction.

Recent advances in neural network architectures have had a tremendous impact in refining the extracted features for lane detection tasks. The fine-tuning step of traditional methods in ROI generation, filtering and tracking has been solved by the use of neural networks. The deep neural networks formalize the lane detection problem as a semantic segmentation task. The vanishing-point-guided network (VPGNet) is guided by vanishing points for road and lane marking detection [39]. [40] proposed LaneNet, which performs detection in two stages: i) lane edge proposal generation and ii) lane localization. PolyLaneNet uses a front-facing camera for lane detection by generating the polynomials for each lane in the image via deep polynomial regression [41]. In [42], the authors formulated lane detection as a row-based selection problem using global features. The use of row-based selection has reduced the computation cost of lane detection tasks. Moreover, the self-attention distillation (SAD) approach is also used in lane detection tasks that allow model self-learning with any additional labels [43]. [44] used two cascaded neural networks in an end-to-end lane detection system.

The most common datasets used by the traditional and deep learning-based methods include the Caltech dataset [13], TuSimple dataset [45], and CULane dataset [46]. These datasets are based on RGB images generated by conventional cameras. The change in illumination and motion blur in the images will affect the performance of the lane detection algorithm. Event cameras are a type of novel sensor that address the problem of standard cameras by having a dynamic range and low latency. In the literature, several event camera datasets have been published, including the Synthesized Dataset [8], Classification Dataset [47], Recognition Dataset [48], and Driving Dataset [49]. The aforementioned event camera datasets are for general purposes, and none of them are explicitly for the lane detection task. Additionally, these datasets have low spatial resolutions. The two main applications that have been published in the research on the event camera include steering angle prediction [50] and car detection [51]. [9] proposed an event-camera dataset for lane detection tasks. The authors evaluated their dataset with different lane detection algorithms, including DeepLabv3 [52], a fully convolutional network (FCN) [53], RefineNet [54], LaneNet [40]

and a spatial convolution neural network (SCNN)

[46], and published the benchmark for lane detection tasks using event cameras. In their lane detection benchmark, the SCNN [46] outperformed all the other algorithms and achieved better mean and mean score. Inspired by [9] in this work, we use their dataset in the lane detection task, and the experimental evaluation of the proposed method surpasses the abovementioned benchmark in terms of both the and scores.

Iii Methods

In this section, we describe in detail the proposed framework for lane marking detection, as illustrated in Fig. 2. The framework consists of three modules: an encoder module, which extracts the features from the input image; a core module consisting of an ASPP block to extract long-range features and an attention-guided decoder module. In addition, skip connections are added from the encoder to the decoder to retain high-frequency spatial features.

Fig. 3: Atrous spatial pyramid pooling with a kernel size of and different rates. Increasing the atrous rate increases the receptive field of view, enabling object encoding at multiple scales.

Iii-a Encoder

CNNs outperform traditional techniques that incorporate handcrafted features for lane marking detection using standard RGB cameras [16] [17] [18]. However, lane marking detection with event cameras is a new research domain, and many state-of-the-art deep learning-based lane detection algorithms, such as SCNN [46], LaneNet [40], and FCN [53], are implemented on event camera images but require further improvement in the robustness [9]. The proposed encoder is shown in Fig. 2. The encoder is designed with four operational blocks, and each block consists of a convolution stack, a DropBlock layer and a max-pooling layer. However, the last operational block does not include a max-pooling layer to match the filter size of the decoder.

The convolution stack of the encoder architecture is adopted from the VGG architecture [55]

. The convolutional layer parameters in terms of the receptive filter size and stride are kept the same as those of the VGG architecture:

and

, respectively. To increase the detailed representation of low-level feature encoding, the convolution stack consists of two convolutional layers followed by batch normalization. A nonlinear activation function is employed after the second convolutional layer, which makes the decision function discriminative. Let

be the higher-dimensional image representation extracted from convolutional layers by progressively processing local features layer by layer. This process categorizes pixels in higher-dimensional space corresponding to their semantics. However, the model predictions are conditioned on the features extracted from the receptive field. For each convolutional layer , a feature map

is obtained by sequentially applying a linear transformation achieved by a nonlinear activation function. The rectified linear unit (ReLU) function is chosen as a nonlinear activation function given by Eq. (

1).

(1)

where represents the channel dimension and denotes the spatial channel dimensions. The feature map activation is formulated as

(2)

represents the convolution operation, is the number of feature maps in layer , and is the convolution kernel. The subscript is ignored for notational clarity in the equation. The function is applied to convolutional layer , where is a trainable kernel parameter. These parameters are learned by minimizing the objective function during training.

The DropBlock layer is introduced after each convolution stack in the operational block, as inspired by [56]. It is a structured form of dropout that is particularly efficient in regularizing the CNN. The notable difference between DropBlock and dropout is that DropBlock drops the contiguous regions from a feature map rather than random independent values. The pseudocode of DropBlock is illustrated as Algorithm 1. and are the two main tuning parameters. The represents the size of the block to be dropped, while is a control parameter for the number of activation connections to be dropped. DropBlock is not applied during evaluations, similar to dropout. A max-pooling layer is incorporated in each operational block to reduce the size of the feature map.

Input: feature map obtained from convolutional layer , , ,
if  then
       return
end if
Randomly generate mask
For each zero position , a spatial square mask is created with size equal to , and center at
Set all the values inside the spatial square mask equal to zero
Apply the mask Normalize the feature map
Algorithm 1 DropBlock Layer
Fig. 4: The attention module used in the attention-guided decoder is illustrated.

Iii-B Atrous spatial pyramid pooling (ASPP) block

In CNNs, reducing the receptive field size results in the loss of spatial information, which is associated with repeated usage of max pooling and strided convolution in the successive layers. One possible way to decrease the spatial loss is the addition of deconvolutional layers [57] [58], but it is computational intensive. The notion of atrous convolution was introduced by [10] [11] to overcome the spatial loss problem. The dilated convolution operation increases the receptive field without increasing the training parameters or feature map resolution. It allows the network to learn higher-dimension features across the entire image for refining full resolution detections. The atrous convolution operation is employed for one-dimensional or two-dimensional input data. Considering one-dimensional input data first, an atrous convolution is formalized as the output of the input signal with a kernel filter of length given by Eq. (3).

(3)

where is the rate parameter that corresponds to the stride through which the input signal is sampled. Fig. 3 shows the concept of an atrous convolution in two dimensions. The standard convolution is an atrous convolution with a rate of . The increase in the rate parameter increases the receptive field of the feature map at any convolutional layer without the increase in computation power and number of parameters. It introduces zeros in the consecutive filter values in the feature map, efficiently increasing the kernel size of the filter to without increasing the number of parameters or increasing the computational complexity. Therefore, it offers an effective mechanism to control the receptive field of view and find the best compromise between the localization of an object of interest and context assimilation. In this work, the core module consists of an ASPP block. The feature map obtained from the encoder module is convoluted with the ASPP block. It consists of six layers with a rate from to . The output from each layer is summed and given to the attention-guided decoder block.

Iii-C Attention-guided decoder

The semantic contextual information is captured efficiently by the acquisition of a large receptive field, and for this step, the feature map is gradually downsampled in a typical CNN. The features on the coarse spatial grid model the location and their relationship with different features at the global level. However, reducing false-positive predictions for small objects with large variability is a challenging task. In this work, we propose a novel attention-guided decoder. The attention-guided decoder progressively reduces the feature response in unrelated background regions without extracting the ROI.

The attention coefficient in the attention-guided decoder distinguishes prominent image regions and prunes features from task-specific activations. The output of the attention module is the elementwise multiplication of attention coefficients and input feature maps given by Eq. (4).

(4)

For each pixel vector

, a single scalar attention value is calculated. [59] proposed a multidimensional attention coefficient to learn sentence embeddings. Since lane marking detection is a multiclass problem, we utilize multidimensional attention coefficients to learn the semantic context in the image. Fig. 4 shows the attention module. The input vector determines the focus region for each pixels . Eq. (5) shows the additive attention formulation.

(5)

where represents the sigmoid activation function. characterizes a set of parameters including the linear transformation , , and bias term , . The channelwise

convolutions compute the linear transformation for the input tensors, which is called ”vector concatenation-based attention” and involves concatenating the features

and and linearly mapping to a multidimensional space [60]. There are three operational blocks in the attention guided decoder. Each block consists of a convolutional stack that is similar to the encoder, an upsampling layer to increase the feature map size and the attention module. The attention module highlights the salient features that are carried through the skip connections, as shown in Fig 2. The features obtained at the coarse scale are used in gating to remove irrelevant and noisy skip connections. It is performed before the concatenation operation to add only relevant activations, as seen in Fig. 4

. A fully connected layer is added at the end of the decoder module, which classifies each pixel in the feature map and is further compared with the corresponding ground truth to calculate the loss during training.

Fig. 5: The qualitative comparison between different lane detection methods using multiclass labels. (a) shows the input image. (b) shows the ground-truth labels. (c-h) show the results for FCN, DeepLabv3, RefineNet, SCNN, LaneNet and LDNet (ours), respectively.

Iv Experiment and Results

The effectiveness of the proposed method for lane marking detection in event camera-based images (DET dataset [9]) is evaluated using multiclass and binary class labels. The results are compared with the state-of-the-art algorithm benchmark on the DET dataset. The proposed method is evaluated in terms of the score and the mean . The details are described below.

Iv-a DET Dataset

In our experiments, we use the benchmark developed by [9]. A high-resolution dynamic vision sensor dataset for lane detection is published. The dynamic vision sensor is a type of event-based sensor that responds to local variations in brightness. It does not follow the principle of standard RGB cameras, but individual pixels are incorporated in the sensor function individually and asynchronously, recording variations in brightness. The DET dataset is collected using a CeleX-V device with a resolution of mounted on a car. The dataset is recorded at different times of day and comprises various traffic scenes, such as urban roads, tunnels, bridges, and overpasses. The dataset also includes various lane types, such as parallel dashed lines, single lines, and single dashed lines. The DET dataset consists of a total of images with binary and multiclass labels. In the case of multiple classes, the labels are categorized into five classes, where four labels correspond to different lane types and the other label is for the background. In this work, we use both types of labels to evaluate the proposed method. For the experimental evaluation, the dataset is split into training, validation and test data at percentages of 50%, 16%, and 33%, respectively.

Iv-B Training details

The proposed network is trained in an end-to-end manner with the input image size fixed at

due to memory constraints. In our experimental evaluation of the proposed method, no prior filter is applied to the input images as a preprocessing step for the removal of noise in the images. The proposed method is evaluated with both label categories, i.e., multiclass labels and binary class labels. However, in experimenting with both labels, the training parameters of the proposed network are kept the same. In training the proposed network, the cross-entropy function is used as a loss function along with the Adam optimizer with an initial learning rate of

; epsilon is set to , and the weight decay is set to

. The poly learning rate policy is adopted to update the learning rate, which decreases after each epoch using Eq. (

6). The training process runs for a total of epochs, with batch size of

using PyTorch deep learning library on an Nvidia RTX 2060 GPU.

(6)

where the value of in training the network is set to . In addition, the DropBlock parameters, i.e., and , are also fixed when training the proposed network. The value of is fixed at , whereas the value is determined by Eq. (7) for controlling the features to be dropped during training.

(7)

where

defines the probability of keeping a unit, and in our experiments, the value of

is linearly increased from to ;

denotes the feature map size. These hyperparameter turning values are inspired by

[56].

Iv-C Evaluation metrics

In the literature, several evaluation metrics exist to assess the performance of algorithms. In the case of semantic segmentation and instance segmentation, the

score and mean are preferred. In this work, we have also used the score and to evaluate the proposed method. The score is given by Eq. (8)

(8)

where

(9)
(10)

, and represent the number of true positives, false positives, and false negatives, respectively. The is given by Eq. (11)

(11)

where represents the predicted lane detection output and denotes the ground-truth labels. , , and represent intersection, union and number of pixels, respectively. We evaluated the and scores for both multiclass and binary class lane detection.

Model Mean (%) Mean (%)
FCN [9] 60.39 47.36
DeepLabv3 [9] 59.76 47.30
RefineNet [9] 63.52 50.29
LaneNet [9] 69.79 53.59
SCNN [9] 70.04 56.29
LDNet-multiclass (ours) 75.58 62.79
TABLE I: Comparison of evaluation results of LDNet with other state-of-the-art methods on the DET dataset. The mean scores () and mean s () are used as evaluation metrics for the multiclass labels. The values in bold are the best scores.
Model Mean (%) Mean (%)
FCN 72.65 58.51
DeepLabv3 71.93 58.45
RefineNet 75.78 61.44
LaneNet 79.21 64.74
SCNN 80.15 67.34
LDNet-binary-class (ours) 85.18 76.71
TABLE II: Comparison of the evaluation results of LDNet with other state-of-the-art methods on the DET dataset. The mean scores () and mean s () are used as evaluation metrics for the binary class labels. The values in bold are the best scores.

Iv-D Results

The DET dataset is benchmarked on typical lane detection algorithms, which include the FCN [53], RefineNet [54], SCNN [46], DeepLabv3 [52] and LaneNet [40]

algorithms. The FCN algorithm is one of the earliest works to perform semantic segmentation by classifying every pixel in an image. An end-to-end FCN is trained to predict the segmentation map. DeepLabv3 investigates ASPP by upsampling a feature map to extract dense and long-range features. RefineNet explores a multipath refinement network that extracts features along the downsampling process to allow high-resolution predictions using long residual connections. However, LaneNet and SCNN were specifically designed for lane detection tasks. SCNNs achieve state-of-the-art accuracy on the TuSimple dataset

[45]. They use slice-by-slice convolutions within feature maps to enable message crossing between pixels across rows and columns. LaneNet applies a learned perspective transformation trained on the images. For each predicted lane, a third-degree polynomial is fitted, and lanes are reprojected onto the images. The aforementioned methods are considered baseline methods and compared with the proposed network. Table I shows the evaluation of the proposed method to the baseline methods. LaneNet and SCNN outperform typical semantic segmentation algorithms such as FCN, DeepLabv3 and RefineNet. However, LDNet (the proposed method) outperforms best-performing state-of-the-art SCNN with an improvement of on the score and on the for multiclass lane detection, and an improvement of on the score and on the for binary class lane detection. This comparison provides insight into how the use of the ASPP module with an attention-guided decoder improves the detection of lane markings. It should be noted that no postprocessing step is utilized in our framework. Fig. 5 shows the qualitative results of the proposed algorithm with the baseline methods in multiclass lane detection.

Fig. 6: The visualization of feature activation with and without the attention module in the decoder. The input image and the corresponding labels are also shown.

V Experimental Analysis

In this section, we investigate the effect of the different factors (using a backbone network before the encoder network, the addition of the DropBlock layer and the attention-guided decoder) on the performance of the proposed method.

We experiment with the proposed network with a deeper encoder by utilizing six different backbone networks: VGG16 [55], ResNet-18 [61], ResNet-50 [61], MobileNetV2 [62], ShuffleNet [63] and DenseNet [64]. The image is fed to the backbone network, and the feature map is given to the proposed encoder. The pretraining weights are used for the backbone networks. Table III shows the results when using the deeper encoder in the proposed network. The evaluation results show no significant gain from incorporating the backbone network compared to the proposed network. This finding justifies the use of shallow encoders in LDNet.

Table IV shows the evaluation of the proposed network with DropBlock, spatial dropout, and no dropout. The dropout layer is added to the network to regularize the network and to prevent overfitting. The addition of the DropBlock shows improved results on the test dataset compared to no dropout or spatial dropout. The contiguous regions in the feature map are highly correlated; dropping random units still allows information flow but is not efficient in regularizing the network. The DropBlock helps the network retain semantic information required for lane marking detection.

Fig. 6 shows the visualization of the feature activations. The comparison between using an attention-guided decoder with a convolution decoder is illustrated. The attention-guided decoder shows improved localization of features, which eliminates the need for external localization of the features and postprocessing steps.

Model Multiclass Binary Class
Mean Mean Mean Mean
LDNet 75.80 62.79 85.18 76.71
LDNet-VGG16 74.42 61.16 83.75 74.98
LDNet-ResNet-18 73.92 60.56 84.127 75.62
LDNet-ResNet-50 74.48 60.20 84.71 76.124
LDNet-MobileNetv2 74.15 60.79 84.03 75.30
LDNet-DenseNet 74.90 61.69 84.11 75.62
LDNet-ShuffleNet 72.72 59.17 83.52 74.70
TABLE III: Quantitative analysis of LDNet with different backbone networks. The experimental analysis is performed on both the multiclass and binary class tasks and the results are evaluated in term of the score and .
Model Multiclass Binary Class
Mean Mean Mean Mean
LDNet-no dropout 74.36 61.09 84.17 75.70
LDNet-dropout2d 72.47 58.94 83.25 72.80
LDNet-DropBlock 75.80 62.79 85.18 76.71
TABLE IV: Quantitative analysis illustrating the effect of dropout and the DropBlock on LDNet. The evaluation is done for both multiclass and binary class labels, and the score and are evaluated for each case and label.

Vi Conclusion

In this paper, we proposed LDNet, a novel encoder-decoder architecture for lane marking detection in event camera images. LDNet extracts higher-dimensional features from an image, refining full-resolution detections. We introduced an ASPP block as the core of the network, which increases the respective field of the feature map without increasing the number of training parameters. The use of an attention-guided decoder improves the localization of features in the feature map, hence removing the need for the postprocessing step. The proposed network, LDNet, is evaluated on an event camera benchmark. LDNet outperformed the best-performing state-of-the-art methods in terms of the score and . LDNet achieves scores of and and s of and for the multiclass and binary class tasks, respectively.

Acknowledgments

This work was partly supported by the ICT RD program of MSIP/IITP (2014-0-00077, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis), GIST Autonomous Vehicle project, Ministry of Culture, Sports and Tourism (MCST), and Korea Creative Content Agency (KOCCA) in the Culture Technology (CT) Research & Development (R2020070004) Program 2020.

References

  • [1] Bengler, Klaus, Klaus Dietmayer, Berthold Farber, Markus Maurer, Christoph Stiller, and Hermann Winner. ”Three decades of driver assistance systems: Review and future perspectives.” IEEE Intelligent transportation systems magazine 6, no. 4 (2014): 6-22.
  • [2] Munir, Farzeen, Shoaib Azam, Muhammad Ishfaq Hussain, Ahmed Muqeem Sheri, and Moongu Jeon. ”Autonomous vehicle: The architecture aspect of self driving car.” In Proceedings of the 2018 International Conference on Sensors, Signal and Image Processing, pp. 1-5. 2018.
  • [3] Deusch, Hendrik, Jürgen Wiest, Stephan Reuter, Magdalena Szczot, Marcus Konrad, and Klaus Dietmayer. ”A random finite set approach to multiple lane detection.” In 2012 15th International IEEE Conference on Intelligent Transportation Systems, pp. 270-275. IEEE, 2012.
  • [4] Jung, Heechul, Junggon Min, and Junmo Kim. ”An efficient lane detection algorithm for lane departure detection.” In 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 976-981. IEEE, 2013.
  • [5] Kim, Jihun, and Minho Lee. ”Robust lane detection based on convolutional neural network and random sample consensus.” In International conference on neural information processing, pp. 454-461. Springer, Cham, 2014.
  • [6] Li, Jun, Xue Mei, Danil Prokhorov, and Dacheng Tao. ”Deep neural network for structural prediction and lane detection in traffic scene.” IEEE transactions on neural networks and learning systems 28, no. 3 (2016): 690-703.
  • [7] Gallego, Guillermo, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger et al. ”Event-based vision: A survey.” arXiv preprint arXiv:1904.08405 (2019).
  • [8]

    Mueggler, Elias, Henri Rebecq, Guillermo Gallego, Tobi Delbruck, and Davide Scaramuzza. ”The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM.” The International Journal of Robotics Research 36, no. 2 (2017): 142-149.

  • [9]

    Cheng, Wensheng, Hao Luo, Wen Yang, Lei Yu, Shoushun Chen, and Wei Li. ”DET: A high-resolution DVS dataset for lane extraction.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0-0. 2019.

  • [10] Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. ”Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40, no. 4 (2017): 834-848.
  • [11] Chen, Liang-Chieh, George Papandreou, Florian Schroff, and Hartwig Adam. ”Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017).
  • [12] Hillel, Aharon Bar, Ronen Lerner, Dan Levi, and Guy Raz. ”Recent progress in road and lane detection: a survey.” Machine vision and applications 25, no. 3 (2014): 727-745.
  • [13] Aly, Mohamed. ”Real time detection of lane markers in urban streets.” In 2008 IEEE Intelligent Vehicles Symposium, pp. 7-12. IEEE, 2008.
  • [14] Wang, Yue, Eam Khwang Teoh, and Dinggang Shen. ”Lane detection and tracking using B-Snake.” Image and Vision computing 22, no. 4 (2004): 269-280.
  • [15] Huval, Brody, Tao Wang, Sameep Tandon, Jeff Kiske, Will Song, Joel Pazhayampallil, Mykhaylo Andriluka et al. ”An empirical evaluation of deep learning on highway driving.” arXiv preprint arXiv:1504.01716 (2015).
  • [16] Wu, Pei-Chen, Chin-Yu Chang, and Chang Hong Lin. ”Lane-mark extraction for automobiles under complex conditions.” Pattern Recognition 47, no. 8 (2014): 2756-2767.
  • [17] Deng, Jiayong, and Youngjoon Han. ”A real-time system of lane detection and tracking based on optimized RANSAC B-spline fitting.” In Proceedings of the 2013 Research in Adaptive and Convergent Systems, pp. 157-164. 2013.
  • [18] Cáceres Hernández, Danilo, Laksono Kurnianggoro, Alexander Filonenko, and Kang Hyun Jo. ”Real-time lane region detection using a combination of geometrical and image features.” Sensors 16, no. 11 (2016): 1935.
  • [19] Son, Jongin, Hunjae Yoo, Sanghoon Kim, and Kwanghoon Sohn. ”Real-time illumination invariant lane detection for lane departure warning system.” Expert Systems with Applications 42, no. 4 (2015): 1816-1824.
  • [20] Lee, Chanho, and Ji-Hyun Moon. ”Robust lane detection and tracking for real-time applications.” IEEE Transactions on Intelligent Transportation Systems 19, no. 12 (2018): 4043-4048.
  • [21] Xu, Shikun, Ping Ye, Shengsheng Han, Hanxu Sun, and Qingxuan Jia. ”Road lane modeling based on RANSAC algorithm and hyperbolic model.” In 2016 3rd international conference on systems and informatics (ICSAI), pp. 97-101. IEEE, 2016.
  • [22] Du, Xinxin, and Kok Kiong Tan. ”Comprehensive and practical vision system for self-driving vehicle lane-level localization.” IEEE transactions on image processing 25, no. 5 (2016): 2075-2088.
  • [23] Jung, Soonhong, Junsic Youn, and Sanghoon Sull. ”Efficient lane detection based on spatiotemporal images.” IEEE Transactions on Intelligent Transportation Systems 17, no. 1 (2015): 289-295.
  • [24] Shin, Bok-Suk, Junli Tao, and Reinhard Klette. ”A superparticle filter for lane detection.” Pattern Recognition 48, no. 11 (2015): 3333-3345.
  • [25] Sun, Tsung-Ying, Shang-Jeng Tsai, and Vincent Chan. ”HSI color model based lane-marking detection.” In 2006 IEEE Intelligent Transportation Systems Conference, pp. 1168-1172. IEEE, 2006.
  • [26] Son, Jongin, Hunjae Yoo, Sanghoon Kim, and Kwanghoon Sohn. ”Real-time illumination invariant lane detection for lane departure warning system.” Expert Systems with Applications 42, no. 4 (2015): 1816-1824.
  • [27] Ma, Chao, and Mei Xie. ”A method for lane detection based on color clustering.” In 2010 Third International Conference on Knowledge Discovery and Data Mining, pp. 200-203. IEEE, 2010.
  • [28] De-hai, S. H. E. N., Z. H. A. N. G. Long-chang, and E. Xu. ”An improved edge detection algorithm based on Sobel operator.” Information Technology 4 (2015): 6.
  • [29] Wang, Yifei, Naim Dahnoun, and Alin Achim. ”A novel system for robust lane detection and tracking.” Signal Processing 92, no. 2 (2012): 319-334.
  • [30] Yoo, Hunjae, Ukil Yang, and Kwanghoon Sohn. ”Gradient-enhancing conversion for illumination-robust lane detection.” IEEE Transactions on Intelligent Transportation Systems 14, no. 3 (2013): 1083-1094.
  • [31] Niu, Jianwei, Jie Lu, Mingliang Xu, Pei Lv, and Xiaoke Zhao. ”Robust lane detection using two-stage feature extraction with curve fitting.” Pattern Recognition 59 (2016): 225-233.
  • [32] Nan, Zhixiong, Ping Wei, Linhai Xu, and Nanning Zheng. ”Efficient lane boundary detection with spatial-temporal knowledge filtering.” Sensors 16, no. 8 (2016): 1276.
  • [33] An, Xiangjing, Erke Shang, Jinze Song, Jian Li, and Hangen He. ”Real-time lane departure warning system based on a single FPGA.” EURASIP Journal on Image and Video Processing 2013, no. 1 (2013): 38.
  • [34] Liang, Minjian, Zhou Zhou, and Qingsong Song. ”Improved lane departure response distortion warning method based on Hough transformation and Kalman filter.” Informatica 41, no. 3 (2017).
  • [35] S. Kwon, D. Ding, J. Yoo, J. Jung, and S. Jin, “Multi-lane dection and tracking using dual parabolic model,” Bull. Netw., Comput., Syst., Softw., vol. 4, no. 1, pp. 65–68, 2015.
  • [36] Kim, ZuWhan. ”Robust lane detection and tracking in challenging scenarios.” IEEE Transactions on Intelligent Transportation Systems 9, no. 1 (2008): 16-26.
  • [37] Krähenbühl, Philipp, and Vladlen Koltun. ”Efficient inference in fully connected crfs with gaussian edge potentials.” In Advances in neural information processing systems, pp. 109-117. 2011.
  • [38] Yuan, Chang, Hui Chen, Ju Liu, Di Zhu, and Yanyan Xu. ”Robust lane detection for complicated road environment based on normal map.” IEEE Access 6 (2018): 49679-49689.
  • [39] Lee, Seokju, Junsik Kim, Jae Shin Yoon, Seunghak Shin, Oleksandr Bailo, Namil Kim, Tae-Hee Lee, Hyun Seok Hong, Seung-Hoon Han, and In So Kweon. ”Vpgnet: Vanishing point guided network for lane and road marking detection and recognition.” In Proceedings of the IEEE international conference on computer vision, pp. 1947-1955. 2017.
  • [40] Neven, Davy, Bert De Brabandere, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. ”Towards end-to-end lane detection: an instance segmentation approach.” In 2018 IEEE intelligent vehicles symposium (IV), pp. 286-291. IEEE, 2018.
  • [41] Tabelini, Lucas, Rodrigo Berriel, Thiago M. Paixão, Claudine Badue, Alberto F. De Souza, and Thiago Oliveira-Santos. ”PolyLaneNet: Lane Estimation via Deep Polynomial Regression.” arXiv preprint arXiv:2004.10924 (2020).
  • [42] Qin, Zequn, Huanyu Wang, and Xi Li. ”Ultra Fast Structure-aware Deep Lane Detection.” arXiv preprint arXiv:2004.11757 (2020).
  • [43] Hou, Yuenan, Zheng Ma, Chunxiao Liu, and Chen Change Loy. ”Learning lightweight lane detection cnns by self attention distillation.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 1013-1021. 2019.
  • [44] Pizzati, Fabio, Marco Allodi, Alejandro Barrera, and Fernando García. ”Lane detection and classification using cascaded CNNs.” In International Conference on Computer Aided Systems Theory, pp. 95-103. Springer, Cham, 2019.
  • [45] The tusimple lane challange. http://benchmark. tusimple.ai
  • [46] Pan, Xingang, Jianping Shi, Ping Luo, Xiaogang Wang, and Xiaoou Tang. ”Spatial as deep: Spatial cnn for traffic scene understanding.” arXiv preprint arXiv:1712.06080 (2017).
  • [47] Li, Hongmin, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. ”Cifar10-dvs: an event-stream dataset for object classification.” Frontiers in neuroscience 11 (2017): 309.
  • [48] Hu, Yuhuang, Hongjie Liu, Michael Pfeiffer, and Tobi Delbruck. ”DVS benchmark datasets for object tracking, action recognition, and object recognition.” Frontiers in neuroscience 10 (2016): 405.
  • [49] Binas, Jonathan, Daniel Neil, Shih-Chii Liu, and Tobi Delbruck. ”DDD17: End-to-end DAVIS driving dataset.” arXiv preprint arXiv:1711.01458 (2017).
  • [50] Maqueda, Ana I., Antonio Loquercio, Guillermo Gallego, Narciso García, and Davide Scaramuzza. ”Event-based vision meets deep learning on steering prediction for self-driving cars.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5419-5427. 2018.
  • [51]

    Chen, Nicholas FY. ”Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 644-653. 2018.

  • [52] Chen, Liang-Chieh, George Papandreou, Florian Schroff, and Hartwig Adam. ”Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017).
  • [53] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional networks for semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431-3440. 2015.
  • [54] Lin, Guosheng, Anton Milan, Chunhua Shen, and Ian Reid. ”Refinenet: Multi-path refinement networks for high-resolution semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1925-1934. 2017.
  • [55] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
  • [56] Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V. Le. ”Dropblock: A regularization method for convolutional networks.” In Advances in Neural Information Processing Systems, pp. 10727-10737. 2018.
  • [57] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional networks for semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431-3440. 2015.
  • [58] Papandreou, George, Iasonas Kokkinos, and Pierre-André Savalle. ”Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 390-399. 2015.
  • [59] Shen, Tao, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. ”Disan: Directional self-attention network for rnn/cnn-free language understanding.” arXiv preprint arXiv:1709.04696 (2017).
  • [60] Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. ”Non-local neural networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794-7803. 2018.
  • [61] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.
  • [62] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. ”Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510-4520. 2018.
  • [63] Ma, Ningning, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ”Shufflenet v2: Practical guidelines for efficient cnn architecture design.” In Proceedings of the European conference on computer vision (ECCV), pp. 116-131. 2018.
  • [64] Huang, Gao, Zhuang Liu, and Kilian Q. Weinberger. ”Densely connected convolutional networks. CoRR abs/1608.06993 (2016).” arXiv preprint arXiv:1608.06993 (2016).