Learning Lightweight Lane Detection CNNs by Self Attention Distillation (ICCV 2019)
Training deep models for lane detection is challenging due to the very subtle and sparse supervisory signals inherent in lane annotations. Without learning from much richer context, these models often fail in challenging scenarios, e.g., severe occlusion, ambiguous lanes, and poor lighting conditions. In this paper, we present a novel knowledge distillation approach, i.e., Self Attention Distillation (SAD), which allows a model to learn from itself and gains substantial improvement without any additional supervision or labels. Specifically, we observe that attention maps extracted from a model trained to a reasonable level would encode rich contextual information. The valuable contextual information can be used as a form of 'free' supervision for further representation learning through performing topdown and layer-wise attention distillation within the network itself. SAD can be easily incorporated in any feedforward convolutional neural networks (CNN) and does not increase the inference time. We validate SAD on three popular lane detection benchmarks (TuSimple, CULane and BDD100K) using lightweight models such as ENet, ResNet-18 and ResNet-34. The lightest model, ENet-SAD, performs comparatively or even surpasses existing algorithms. Notably, ENet-SAD has 20 x fewer parameters and runs 10 x faster compared to the state-of-the-art SCNN, while still achieving compelling performance in all benchmarks. Our code is available at https://github.com/cardwing/Codes-for-Lane-Detection.READ FULL TEXT VIEW PDF
Learning Lightweight Lane Detection CNNs by Self Attention Distillation (ICCV 2019)
Lane detection  plays a pivotal role in autonomous driving as lanes could serve as significant cues for constraining the maneuver of vehicles on roads. Detecting lanes in-the-wild is challenging due to poor lighting conditions, occlusions caused by other vehicles, irrelevant road markings, and the inherent long and thin property of lanes.
Contemporary algorithms [5, 8, 14, 16] typically adopt a dense prediction formulation, , treat lane detection as a semantic segmentation task, where each pixel in an image is assigned with a binary label to indicate whether it belongs to a lane or not. These methods heavily rely on the segmentation maps of lanes as the supervisory signals. Since lanes are long and thin, the number of annotated lane pixels is far fewer than the background pixels. Learning from such subtle and sparse annotations becomes a major challenge in training deep models for the task. A plausible way is to increase the width of lane annotations. However, it may degrade the detection performance.
Several schemes have been proposed to relieve the reliance of deep models on the sparse annotations, , multi-task learning (MTL) and message passing (MP). For example, Lee  exploit vanishing points to guide the training of deep models and Pan 
incorporate spatial MP in their lane detection models. MTL can indeed provide additional supervisory signals but it requires additional efforts, usually with human intervention, to prepare the annotations, , scene segmentation maps, vanishing points, or drivable areas. MP can help propagate the information between neurons to counter the effect of sparse supervision and better capture the scene context. However, it increases the inference time significantly due to the overhead of MP. For instance, applying MP in a layer of SCNN contributes 35% of its total feed-forward time.
In this work, we present a simple yet novel approach that allows a lane detection network to reinforce representation learning of itself without the need of additional labels and external supervisions. In addition, it does not increase the inference time of the base model. Our approach is named Self-Attention Distillation
(SAD). As the name implies, SAD allows a network to exploit attention maps derived from its own layers as the distillation targets for its lower layers. Such an attention distillation mechanism is used to complement the usual segmentation-based supervised learning.
SAD is motivated by an interesting observation – when a lane detection network is trained to a reasonable level, attention maps derived from different layers would capture diverse and rich contextual information that hints the lane locations and a rough outline of the scene, as shown in Fig. 1 (before SAD at 40K episodes). By adding SAD to the learning of this half-trained model, , having the preceding block to mimic the attention maps of a deeper block, , block 3 block 4 and block 2 block 3, the network can learn to strengthen its representations, as shown in Fig. 1 (after SAD): (1) the attention maps of lower layers are refined, with richer scene contexts captured by the visual attention, and (2) the better representation learned at lower layers in turn benefits the deeper layers. For instance, although block 4 does not learn from any distillation targets, its representation is reinforced, as evident from the much distinct attention at the lane locations. By contrast, without using SAD, the visual attentions of different layers of the same network hardly improve despite continual training up to 60K episodes.
SAD opens a new possibility of training accurate lane detection networks apart from deploying existing techniques such as multi-task learning and message passing, which can be expensive. It allows us to train small networks with excellent visual attention that is on par with very deep networks. In our experiments, we successfully demonstrate the effectiveness of SAD on a few popular lightweight models, , ENet , ResNet-18  and ResNet-34 .
In summary, our contributions are three-fold: (1) We propose a novel attention distillation approach, i.e., SAD, to enhance the representation learning of CNN-based lane detection models. SAD is only used in the training phase and brings no computational cost during the deployment. Our work is the first attempt of using a network’s own attention maps as the distillation targets. (2) We carefully and systematically investigate the inner mechanism of SAD, the consideration of choosing among different layer-wise mimicking paths, and the timepoint of introducing SAD to the training process for improved gains. (3) We verify the usefulness of SAD on boosting the performance of small lane detection networks. We further present several architectural reformulations to ENet  for improved performance. Our lightweight model, ENet-SAD, achieves state-of-the-art lane detection performance on TuSimple , CULane  and BDD100K . It can serve as a strong backbone to facilitate future research on lane detection.
. These methods have many shortcomings, e.g., requiring complex feature selection process, being lack of robustness and only applicable to relatively easy driving scenarios.
Recently, deep learning has been employed to omit hand-crafted features altogether and learn to extract features in an end-to-end manner[14, 16, 8, 5]. These approaches usually adopt the dense prediction formulation, i.e., treat lane detection as a semantic segmentation task, where each pixel in an image is assigned with a label to indicate whether it belongs to a lane or not. For example, He  propose Dual-View CNN (DVCNN) to handle lane detection. The method takes front-view and top-view images as inputs. Another popular paradigm performs lane detection from the perspective of instance segmentation. For instance, Neven 
divide lane detection into two stages. Specifically, they first perform binary segmentation that differentiates lane pixels and background pixels. These lane pixels are then classified into different lane instances.
Several schemes have been proposed to complement the lane-based supervision and to capture richer scene context, , multi-task learning and message passing. For example, Zhang  establish a framework that accomplishes lane boundary segmentation and road area segmentation simultaneously. Geometric constraints that lane boundaries and lane areas constitute the road are also included to further enhance the final performance. Mohsen  take lane labels as extra inputs and integrate generative adversarial network (GAN) into the original framework so that the segmentation maps resemble labels more. Pan  perform sequential massage passing between the outputs of top-level layers to better exploit the structural information. While the aforementioned methods do bring additional gains to the performance, multi-task learning requires extra annotations and message passing is not efficient since it propagates information in a sequential way. On the contrary, the proposed SAD is free from the requirement of extra annotations and it does not increase the inference time.
Knowledge and attention distillation. Knowledge distillation was originally proposed by  to transfer the knowledge from large networks to small networks. Commonly in knowledge distillation, a small student network mimics the intermediate outputs of large teacher networks as well as the labels. In [7, 21] the student and teacher networks share the same capacity and mimicking is performed between pairs of layers with same dimensionality. Hou  also investigate knowledge distillation performed between heterogeneous networks. Recent studies [24, 19] have expanded knowledge distillation to attention distillation. For instance, Sergey  introduce two types of attention distillation, , activation-based attention distillation and gradient-based attention distillation. In both kinds of distillation, a student network is trained through learning attention maps derived from a teacher network. The proposed SAD differs to  in that our method does not need a teacher network. Distillation is conducted in a layer-wise and top-down manner, in which attention knowledge is propagated layer by layer. This is new in the literature. It is noteworthy that our focus is to investigate the possibility of distilling layer-wise attention for self-learning. This differs from existing studies on using visual attention for weighting features [4, 13, 24].
Lane detection is commonly formulated as a semantic segmentation task. More specifically, given an input image X, the objective is to assign a label ( = 1, …, ) to each pixel of X, comprising the segmentation map . Here, is the number of classes. The objective is to learn a mapping : X . Recent studies use CNN as for end-to-end prediction. The task of lane existence prediction is also introduced to facilitate the evaluation process. We use to represent the binary labels that indicate the existence of lanes. Then, the function becomes : X (, ).
Apart from training our lane detection network with the aforementioned semantic segmentation and lane existence prediction losses, we aim to perform layer-wise and top-down attention distillation to enhance the representation learning process. The proposed SAD does not require any external supervision or additional labels since attention maps are derived from the network itself.
In general, attention maps can be divided into two categories, , activation-based attention maps  and gradient-based attention maps . The activation-based attention maps are obtained via processing the activation output of a specific layer while the gradient-based ones are obtained via using the layer’s gradient output. In the experiment, we empirically find that activation-based attention distillation yields considerable performance gains while gradient-based attention distillation barely works. Hence, in the following sections we only discuss the activation-based attention distillation.
Activation-based attention distillation. We use to denote the activation output of the -th layer of the network, where , and denote the channel, height and width, respectively. Let denote the number of layers in the network. The generation of the attention map is equivalent to finding a mapping function : . The absolute value of each element in this map represents the importance of this element on the final output. Therefore, this mapping function can be constructed via computing statistics of these values across the channel dimension. More specifically, the following three operations  can serve as the mapping function: , and . Here, and denotes the -th slice of in the channel dimension.
The differences between these mapping functions are depicted in Fig. 2. Compared with , puts more weights to areas with higher activations. The larger the is, the more focus is placed on these highly activated areas. Compared with , is less biased since it calculates weights across multiple neurons instead of selecting the maximum value of these neuron activations as the weight. In the experiment, we empirically find that using as the mapping function yields the most performance gains.
Adding SAD to training. The intuition behind SAD is that the attention maps of previous layers can distil useful contextual information from those of successive layers. Following , we also perform spatial softmax operation on . Bilinear upsampling is added before the softmax operation if the size of original attention maps is different from that of targets. However, different from Sergey  who perform attention distillation within two networks, the proposed self attention distillation is performed within the network itself.
Adding SAD to an existing network is straight-forward. It is possible to introduce SAD at different timepoint of the training, which could affect the convergence time. We will show an evaluation in the experiment section. Here we assume an ENet half-trained to 40K episodes. As shown in Fig. 3, we add an attention generator, abbreviated as AT-GEN, after each , , and encoder block of ENet. Formally, AT-GEN is represented by a function . A successive layer-wise distillation loss is formulated as follows:
where is typically defined as a loss and is the target of the distillation loss. In the example shown in Fig. 3, we have the number of layers . Note that we do not assign different weights to different SAD paths, although this is possible. We found that this uniform scheme works well in our experiments.
The total loss is comprised of four terms:
Here, the first two terms are segmentation losses that comprise of the standard cross entropy loss and the IoU loss . The IoU loss aims at increasing the intersection-over-union between the predicted lane pixels and ground-truth lane pixels. It is formulated as , where is the number of predicted lane pixels, is the number of ground-truth lane pixels and is the number of lane pixels in the overlapped areas between predicted lane areas and ground-truth lane areas. is the binary cross entropy loss. is the segmentation map produced by the network and is the prediction of the existence of lanes. The parameters , , and balance the influence of segmentation losses, existence loss, and distillation loss on the final task.
It is noteworthy that the SAD paths can be generalized to dense connections beyond the example shown here. For instance, we can add block 1 block 3, block 1 block 4, and block 2 block 4 in addition to the current paths. In general, the number of possible SAD paths for a network with a depth of layers is . We will evaluate this possibility in our experiments.
Visualization of attention maps with and without SAD. We investigate the influence of SAD by studying the attention maps of different blocks in ENet with and without SAD. More results will be reported in Section 4. Both networks with and without SAD are trained up to 60K episodes. We visualize the attention maps of four existing blocks in ENet. As can be observed in Fig. 4, after adding SAD, the attention maps of ENet become more concentrated on task-relevant objects, , lanes, vehicles and road curbs. This would in turn improve the lane detection accuracy, as we will show in the experiments.
The output of the model is not post-processed for TuSimple and BDD100K except CULane. For CULane, in the inference stage, we feed the image into the ENet model. Then the multi-channel probability maps and the lane existence vector are obtained. Following, the final output is obtained as follows: First, we use a 9 9 kernel to smooth the probability maps. Then, for each lane whose existence probability is larger than 0.5, we search the corresponding probability map every 20 rows for the position with the highest probability value. In the end, we use cubic splines to connect these positions to get the final output.
The original ENet model is an encoder-decoder structure comprised of , and . Following , we add a small network to predict the existence of lanes. The encoder module is shared to save memory space. Apart from this modification, we also observed some useful techniques to modify ENet for achieving better performance in the lane detection task. Dilated convolution  is added to replace the original convolution layers in the lane existence prediction branch to increase the receptive field of the network without increasing the number of parameters. In the original design, the resolution of feature maps of is only 36 100 for CULane. This leads to severe loss of information. Hence, we use feature concatenation to fuse the output of with that of so that the output of the encoder can benefit from information encoded in previous layers.
|Name||# Frame||Train||Validation||Test||Resolution||Road Type||# Lane 5 ?|
|TuSimple ||6, 408||3, 268||358||2, 782||1280 720||highway|
|CULane ||133, 235||88, 880||9, 675||34, 680||1640 590||urban, rural and highway|
|BDD100K ||80, 000||60, 000||10, 000||10, 000||1280 720||urban, rural and highway|
Datasets. Figure 5 shows several video frames of three datasets that we use in our experiments. They are TuSimple , CULane  and BDD100K . TuSimple and CULane are widely used in the literature. Many algorithms [16, 15, 8] have been tested in TuSimple since it was the largest lane detection dataset before 2018. As to CULane, it contains many challenging driving scenarios like crowded road conditions or roads under poor lighting (see Fig. 5). BDD100K is originally designed for lane instance classification. However, since there are typically multiple lanes in an image and these lanes are usually very close to each other, using instance segmentation algorithms will yield inferior performance. Therefore, we choose to only detect lanes without differentiating lane instances for BDD100K. We discuss the details of transforming the original ground truths for our task in the following section on implementation details. Table 1 summarizes their details. Note that the last column of Table 1 shows that TuSimple and CULane have no more than 5 lanes in a video frame while BDD100K typically contains more than 8 lanes in a video frame. Besides, TuSimple is relatively easy while CULane and BDD100K are more challenging considering the total number of video frames and road types. Note that the original BDD100K dataset provides 100K video frames, in which 70K are used for training, 10K for validation and 20K for testing. However, since the ground-truth labels of the testing partition are not publicly available, we keep the training set unchanged but use the original validation set for testing. A new validation set is allocated separately from the training set, as shown in Table 1.
. To facilitate comparisons against previous studies, we follow the literature and use the corresponding evaluation metrics for each particular dataset.
1) TuSimple. We use the official metric (accuracy) as the evaluation criterion. Besides, false positive () and false negative () are also reported. Accuracy is computed as : , where is the number of correctly predicted lane points and is the number of ground-truth lane points.
2) CULane. Following , to judge whether a lane is correctly detected, we treat each lane as a line with 30 pixel width and compute the intersection-over-union (IoU) between labels and predictions. Predictions whose IoUs are larger than 0.5 are considered as true positives (TP). Then, we use measure as the evaluation metric, which is defined as: , where and .
3) BDD100K. Since there are typically more than 8 lanes in an image, we decide to use pixel accuracy and IoU of lanes to evaluate the performance of different models.
Implementation details. Following , we resize the images of TuSimple and CULane to 368640 and 288800, respectively. As to BDD100K, we resize the image to 360640 to save memory usage. The lanes of BDD100K are labelled by two lines. Training the networks using the provided labels is tricky. Therefore, based on these two lines, we calculate the center lines as new targets. We dilate ground-truth lanes of the training set of BDD100K as 8 pixels to provide denser targets while keeping these of testing set unchanged (2 pixels). We use SGD  to train our models and the learning rate is set to 0.01. Batch size is set as 12 and the total number of training episodes is set as 1800 for TuSimple and 60K for CULane and BDD100K. The cross entropy loss of background pixels is multiplied by 0.4. Loss coefficients , , and are set as 0.1. Since we select lane pixel accuracy and IoU as the evaluation criterion for BDD100K dataset, we alter the original segmentation branch to output binary segmentation maps to facilitate the evaluation on BDD100K. The lane existence prediction branch is also removed for the BDD100K evaluation.
We empirically found that several practical techniques, , data augmentation and IoU loss, can considerably enhance the performance of CNN-based lane detection models. As to data augmentation, we use random rotation, random cropping and horizontal flipping to process the input images. In our experiments, we apply the same segmentation losses and augmentation strategy to our method, SCNN, ResNet baselines, and deep supervision methods, to ensure a fair comparison. Since the source codes of LaneNet  and EL-GAN  are not available, we use their results reported in their papers.
|Category||Proportion||ENet-SAD||R-18-SAD||R-34-SAD||R-101-SAD||ResNet-101 ||SCNN |
Tables 2-4 summarize the performance of our methods, , ResNet-18-SAD, ResNet-34-SAD, and ENet-SAD against state-of-the-art algorithms on the testing set of TuSimple, CULane and BDD100K datasets. We also report the runtime and parameter count of different algorithm in Table 3 so that we can compare the performance with the complexity of the model taken into account. The runtime is recorded using a single GPU (GeForce GTX TITAN X) and the final value of runtime is obtained after averaging the runtime of 100 samples.
It is observed that ENet-SAD outperforms all baselines in BDD100K while achieving compelling performance in TuSimple and CULane. Considering that ENet-SAD has 20 fewer parameters and runs 10 faster compared with SCNN on CULane testing set, the performance strongly suggests the effectiveness of SAD. It is observed that ResNet-18-SAD and ResNet-34-SAD achieve slightly inferior performance to ENet-SAD despite their larger model capacity. The is because ResNet-18 and ResNet-34 only use spatial upsampling as the decoder while ENet has a specially designed decoder for the task. It is noteworthy that SAD also helps given a deeper model. Specifically, we apply SAD to ResNet-101, and find that it increases the F-measure from 70.8 to 71.8 in CULane and the accuracy increases from 34.45% to 35.56% in BDD100K.
We show some qualitative results of our algorithm and some baselines in these three benchmarks. As can be seen in Fig. 6, ENet-SAD can detect lanes more precisely than ENet  in TuSimple and CUlane. As can be seen in Fig. 7, the output probability maps of ENet-SAD are more compact and contain less noise compared with those of vanilla ENet and SCNN in poor lighting conditions. However, since many images in BDD100K contain more than 8 lanes and are collected in challenging scenarios like severe occlusion and poor lighting conditions, the performance of all algorithms is unsatisfactory and needs further improvement. In general, SAD can improve the visual attention as well as the detection performance in challenging conditions like crowded roads and poor light conditions.
We also perform experiments that apply SAD and remove the effect of the P1 branch by blocking the gradient of the P1 branch from the main branch. Results show that ENet-SAD (without supervision from P1 branch) can still achieve 96.61 on TuSimple, 70.8 on CULane and 36.54 on BDD100K, which means the performance gains come mainly from SAD itself.
We investigate the effects of different factors, e.g., the mimicking path, on the final performance. Besides, we also perform extensive experiments to investigate the timepoint to introduce SAD in the training process.
Distillation paths of SAD. We summarize the performance of performing SAD between different blocks of ENet in Table 5. We have a few observations. (1) SAD works well in the middle and high-level layers. (2) Adding SAD in low level layers will degrade the performance. The reason why SAD does not work in low-level layers is that these layers are originally designated to detect low-level details of the scene. Making them to mimic the attention maps of later layers will inevitably harm their ability of detecting local features since later layers encode more global information. Besides, we also find that mimicking the attention maps of the neighbouring layer successively brings more performance gains compared with mimicking those of non-adjacent layers ( + outperforms + ). We conjecture that attention maps of neighbouring layers are closer from the semantic perspective compared with those of non-neighbouring layers (see Fig. 1).
Backward distillation. We also tested another distillation scheme that makes higher layers to mimic lower layers. It decreases the performance of ENet from 93.02% to 91.26% in TuSimple dataset. This is not surprising as low-level attention maps contain more details and are more noisy. Having higher-level layers to mimic lower layers will inevitably interfere the global information captured in higher layers, hampering the crucial clues for the lane detection task.
SAD v.s. Deep Supervision. We also compare SAD with deep supervision . Here, deep supervision denotes the algorithm that uses the labels directly as supervision for each layer in the network. More specifically, we use 1x1 convolution and bilinear upsampling to obtain the prediction of intermediate layers and use the cross entropy loss to train the intermediate outputs of the model. We empirically find that adding deep supervision in blocks 2 to 4 obtains the most significant performance gains. As can be seen in Table 6, SAD brings more performance gains than deep supervision in all three benchmarks. We attribute this to the following reasons. Firstly, compared with labels that are considered sparse and rigid, SAD provides softer attention targets that capture more contextual information that indicate the scene structure. Distilling information from attention maps of later layers helps previous layers to grasp the contextual signals. Secondly, a SAD path offers a feedback connection from deeper layers to shallower layers. The connection helps facilitate reciprocal learning between successive layers through attention distillation.
When to add SAD. Recall that we assume a half-trained model before we add SAD into the training. Here, we investigate the different timepoints to add SAD. As can be seen in Fig. 8, although different timepoints of introducing SAD achieve almost the same performance in the end, the time to add SAD has an effect on the convergence speed of the networks. We attribute the phenomenon to the quality of the target attention maps produced by later layers. In earlier training stage, deeper layers have not been trained well and therefore the distillation targets produced by these layers are of low quality. Introducing SAD at these earlier stages is not as fruitful. Conversely, adding SAD in later training stage would benefit the representation learning of the previous layers.
We have proposed a simple yet effective attention distillation approach, , SAD, to improve the representation learning of CNN-based lane detection models. SAD is validated in various models (, ENet, ResNet-18, ResNet-34, and ResNet-101) and achieves consistent performance gains in three popular benchmarks (, TuSimple, CULane and BDD100K), demonstrating the effectiveness of SAD. The results show that SAD can generally improve the visual attention of different layers in various networks. It would be interesting to extend this idea to other tasks that demands fine-grained attention to details, such as image saliency detection and image matting.
Acknowledgement: This work is supported by SenseTime Group Limited, the General Research Fund sponsored by the Research Grants Council of the Hong Kong SAR (CUHK 14241716), Singapore MOE AcRF Tier 1 (M4012082.020), NTU SUG, and NTU NAP.
Association for the Advancement of Artificial Intelligence, Cited by: §2.
Spatial as deep: spatial CNN for traffic scene understanding. In Association for the Advancement of Artificial Intelligence, Cited by: Appendix B, Appendix C, Learning Lightweight Lane Detection CNNs by Self Attention Distillation, §1, §1, §1, §2, §2, Figure 3, §3.2, §3.3, Table 1, Table 2, Table 3, Table 4, §4, §4, §4.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2.
Table 7 summarizes the architecture of the lane existence prediction branch for ENet-SAD, ResNet-18-SAD and ResNet-34-SAD. As to ResNet-18-SAD and ResNet-34-SAD, we also use dilated convolution  to replace the original convolution layers in the last two blocks for ResNet-18  and ResNet-34 .
|Layer Name||Output Size|
|Dilated Convolution (3, 1, 4, 4)||32 36 100|
|Batch Normalization||32 36 100|
|Relu||32 36 100|
|Spatial Dropout (0.1)||32 36 100|
|Convolution (1, 1)||5 36 100|
|Spatial SoftMax||5 36 100|
|Average Pooling||5 18 50|
denote channel, height and width, respectively. The number in the bracket besides the layer name is the parameter for that layer. For instance, the four numbers besides dilated convolution denote kernel size, stride, padding and dilated rate, respectively.
For CULane, in the inference stage, we feed the image into the ENet model. Then the multi-channel probability maps and the lane existence vector are obtained. Following , the final output is obtained as follows: First, we use a 9 9 kernel to smooth the probability maps. Then, for each lane whose existence probability is larger than 0.5, we search the corresponding probability map every 20 rows for the position with the highest probability value. Finally, we use cubic splines to connect these positions to get the final output. The process improves the final lane prediction results as it removes noises in the probability maps. The process is depicted in Figure 9. Here, we differentiate different lane instances with different colors.
Figures 10 and 11 depict the qualitative results of different algorithms on TuSimple , CULane  and BDD100K . As can be seen in Fig. 10, ENet-SAD can detect lanes more precisely than ENet  in TuSimple and CUlane. Besides, the detection of ENet-SAD is less affected by the irrelevant objects on the road compared with SCNN . As can be seen in Fig. 11, the output probability maps of ENet-SAD are more compact and contain less noise compared with those of SCNN in poor light conditions.