End-to-end Lane Detection for Self-Driving Cars (ICCV 2019 Workshop)
Lane detection is typically tackled with a two-step pipeline in which a segmentation mask of the lane markings is predicted first, and a lane line model (like a parabola or spline) is fitted to the post-processed mask next. The problem with such a two-step approach is that the parameters of the network are not optimized for the true task of interest (estimating the lane curvature parameters) but for a proxy task (segmenting the lane markings), resulting in sub-optimal performance. In this work, we propose a method to train a lane detector in an end-to-end manner, directly regressing the lane parameters. The architecture consists of two components: a deep network that predicts a segmentation-like weight map for each lane line, and a differentiable least-squares fitting module that returns for each map the parameters of the best-fitting curve in the weighted least-squares sense. These parameters can subsequently be supervised with a loss function of choice. Our method relies on the observation that it is possible to backpropagate through a least-squares fitting procedure. This leads to an end-to-end method where the features are optimized for the true task of interest: the network implicitly learns to generate features that prevent instabilities during the model fitting step, as opposed to two-step pipelines that need to handle outliers with heuristics. Additionally, the system is not just a black box but offers a degree of interpretability because the intermediately generated segmentation-like weight maps can be inspected and visualized. Code and a video is available at github.com/wvangansbeke/LaneDetection_End2End.READ FULL TEXT VIEW PDF
End-to-end Lane Detection for Self-Driving Cars (ICCV 2019 Workshop)
A general trend in deep learning for computer vision is to incorporate prior knowledge about the problem at hand into the network architecture and loss function. Leveraging the large body of fundamental computer vision theory and recycling it into a deep learning framework gives the best of two worlds: the parameter-efficiency of engineered components combined with the power of learned features. The challenge is in reformulating these classical ideas in a manner that they can be integrated into a deep learning framework.
An illustration of this integration is at the intersection of deep learning and scene geometry [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, and Kavukcuoglu, Yan et al.(2016)Yan, Yang, Yumer, Guo, and Lee, Handa et al.(2016)Handa, Bloesch, Pătrăucean, Stent, McCormac, and Davison, Kendall et al.(2017a)Kendall, Cipolla, et al.]: the general idea is to design differentiable modules that give a deep network the capability to apply geometric transformations to the data, and to use geometry-aware criterions as loss functions. The large body of classical research on geometry in computer vision [Faugeras(1993), Hartley and Zisserman(2003), Dorst et al.(2009)Dorst, Fontijne, and Mann] has inspired several methods to incorporate geometric knowledge into the network architecture [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, and Kavukcuoglu, Yan et al.(2016)Yan, Yang, Yumer, Guo, and Lee, Handa et al.(2016)Handa, Bloesch, Pătrăucean, Stent, McCormac, and Davison, Kendall et al.(2017a)Kendall, Cipolla, et al.]
. The spatial transformer network (STN) of Jaderberget al [Jaderberg et al.(2015)Jaderberg, Simonyan, Zisserman, and Kavukcuoglu] and the perspective transformer net of Yan et al [Yan et al.(2016)Yan, Yang, Yumer, Guo, and Lee] introduce differentiable modules for the spatial manipulation of data in the network. Handa et al [Handa et al.(2016)Handa, Bloesch, Pătrăucean, Stent, McCormac, and Davison] extend the STN to 3D transformations. Kendall et al [Kendall et al.(2015)Kendall, Grimes, and Cipolla, Kendall et al.(2017a)Kendall, Cipolla, et al.] propose a deep learning architecture for 6-DOF camera relocalization and show that instead of naively regressing the camera pose parameters, much higher performance can be reached by taking scene geometry into account and designing a theoretically sound geometric loss function. Other examples of exploiting geometric knowledge as a form of regularization in deep networks include [Boscaini et al.(2016)Boscaini, Masci, Rodolà, and Bronstein, Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and Vandergheynst, Kendall et al.(2017b)Kendall, Martirosyan, Dasgupta, Henry, Kennedy, Bachrach, and Bry].
In a similar spirit, we propose in this work a lane detection method that exploits prior geometric knowledge about the task, by integrating a least-squares fitting procedure directly into the network architecture. Lane detection is traditionally tackled with a multi-stage pipeline involving separate feature extraction and model fitting steps: First, dense or sparse features are extracted from the image with a method like SIFT or SURF[Lowe(1999), Bay et al.(2006)Bay, Tuytelaars, and Van Gool]. Second, the features are fed as input to an iterative model fitting step such as RANSAC [Fischler and Bolles(1981)] to find the parameters of the best fitting model, which are the desired outputs of the algorithm. In this work we replace the feature extraction step with a deep network, and we integrate the model fitting step as a differentiable module into the network. The output of the network are the lane model parameters, which we supervise with a geometric loss function. The benefit of this end-to-end framework compared to using a multi-stage pipeline of separate steps is that all components are trained jointly: the features can adapt to the task of interest, preventing outliers during the model fitting step. Moreover, our proposed system is not just a black box but offers a degree of interpretability because the intermediately generated weight map is segmentation-like and can be inspected and visualized.
There is a vast literature on lane detection, including many recent methods that employ a CNN for the feature extraction step [Wang et al.(1998)Wang, Shen, and Teoh, McCall and Trivedi(2006), Xu et al.(2009)Xu, Wang, Huang, Wu, and Fang, Lu(2015), Gurghian et al.(2016)Gurghian, Koduri, Bailur, Carey, and Murali]. We refer to [Neven et al.(2018)Neven, De Brabandere, Georgoulis, Proesmans, and Van Gool] for an overview. Discussing all these approaches at length is out of the scope of this work, but most of them have in common that they tackle the task with a multi-stage pipeline involving separate feature extraction and model fitting steps. Our goal in this work is not to outperform these already highly-optimized approaches, but to show that without bells and whistles, lane parameter estimation using our proposed end-to-end method outperforms a multi-step procedure.
Our work is also related to a class of methods which backpropagate through an optimization procedure. The general idea is to include an optimization process within the network itself (in-the-loop optimization). This is possible if the optimization process is differentiable, so that the loss can be backpropagated through it [Rockafellar and Wets(2009)]. One approach is to unroll a gradient descent procedure within the network [Domke(2012), Metz et al.(2017)Metz, Poole, Pfau, and Sohl-Dickstein]. Another approach proposed by Amos et al [Amos and Kolter(2017)] is to solve a quadratic program (QP) problem exactly using a differentiable interior points method. Our model fitting module solves a weighted least-squares problem, which is a specific instantiation of a QP problem. Our contribution lies in showing the power of differentiable optimization in the form of weighted least-squares fitting on a real-world computer vision task.
We propose an end-to-end trainable framework for lane detection. The framework consists of three main modules, as shown schematically in Figure 1: a deep network which generates weighted pixel coordinates, a differentiable weighted least-squares fitting module, and a geometric loss function. We now discuss each of these components in detail.
Each pixel in an image has a fixed -coordinate associated with it in the image reference frame. In a reference frame with normalized coordinates, the coordinate of the upper left pixel is and the coordinate of the bottom right pixel is . These coordinates can be represented as two fixed feature maps of the same size as the image: one containing the normalized x-coordinates of each pixel and one containing the normalized y-coordinates, indicated by the two white maps in Figure 1.
We can equip each coordinate with a weight
, predicted by a deep neural network conditioned on the input image. This is achieved by designing the network to generate a feature map with the same spatial dimensions as the input image (and hence also the same spatial dimensions as the x- and y- coordinate maps), representing aweight for each pixel coordinate. Any off-the-shelf dense prediction architecture can be used for this. In order to restrict the weight maps to be non-negative, the output of the network is squared. If we flatten the generated weight map and the two coordinate maps, we obtain a list of triplets of respectively the x-coordinate, y-coordinate and coordinate weight of each pixel in the image. If the image has height and width , the list contains triplets. This list is the input to the weighted least-squares fitting module discussed next.
For the task of lane detection, the network must generate multiple weight maps: one for each lane line that needs to be detected. A lane line (sometimes also referred to as curve) is defined as the line that separates two lanes, usually indicated by lane markings on the road. E.g. in the case of ego-lane detection, the network outputs two weight maps; one for the lane line immediately to the left of the car and one for the lane line immediately to the right, as these lines constitute the borders of the ego-lane.
The fitting module takes the list of triplets and interprets them as weighted points in 2D space. Its purpose is to fit a curve (e.g. a parabola, spline or other polynomial curve) through the list of coordinates in the weighted least-squares sense, and to output the parameters of that best-fitting curve.
Many traditional computer vision methods employ curve fitting as a crucial step in their pipeline. One fundamental and simple fitting procedure is linear least-squares. Consider a system of linear equations
with , , and :
There are equations in unknowns. If , the system is overdetermined and no exact solution exists. We resort to finding the least-squares solution, which is defined as the solution which minimizes the sum of squared differences between the data values and their corresponding modeled values:
The solution is found by solving the normal equations, and involves the matrix multiplication of with the pseudo-inverse of :
We can extend the previous formulation to a weighted least-squares problem. Let be a diagonal matrix containing weights for each observation. In our framework, the observation will correspond to the fixed -coordinates in the image reference frame, and the weights will be generated by a deep network conditioned on the image. The weighted least-squares problem is
By defining and with
we can reformulate this problem in the standard form and solve it in the same way as before.
Recall from the previous section that we have a list of weighted pixel coordinates where the coordinates are fixed and the weights are generated by a deep network conditioned on an input image. We can use these values to construct the matrices , and , solve the weighted least-squares problem, and obtain the parameters of the best-fitting curve through the weighted pixel coordinates.
The contribution of this work lies in the following insight: instead of treating the fitting procedure as a separate post-processing step, we can backpropagate through it and apply a loss function on the parameters of interest rather than indirectly on the weight maps produced by the network. This way, we obtain a powerful tool for tackling lane detection within a deep learning framework in an end-to-end manner.
Note that equations 2 and 3 only involve differentiable matrix operations. It is thus possible to calculate derivatives of with respect to , and consequently also with respect to the parameters of the deep network. The specifics of backpropagating through matrix transformations are well understood. We refer to [Giles(2008)]
for the derivation of the gradients of this problem using Cholesky decomposition. Efficient implementations are available in TensorFlow[Abadi et al.(2015)Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur, Levenberg, Mané, Monga, Moore, Murray, Olah, Schuster, Shlens, Steiner, Sutskever, Talwar, Tucker, Vanhoucke, Vasudevan, Viégas, Vinyals, Warden, Wattenberg, Wicke, Yu, and Zheng]
and PyTorch[Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer].
By backpropagating the loss through the weighted least-squares problem, the deep network can learn to generate a weight map that gives accurate lane line parameters when fitted with a least-squares procedure, rather than optimizing the weight map for a proxy objective like lane line segmentation with curve fitting as a separate post-processing step.
The model parameters that are the output of the curve fitting step can be supervised directly with a mean squared error criterion or via a more principled geometric loss function as the one discussed in the next section.
The curve parameters that are the output of the curve fitting step could be supervised by comparing them with the ground truth parameters with a mean squared error criterion, leading to the following L2 loss:
The problem with this is that the curve parameters have different sensitivities: a small error in one parameter value might have a larger effect on the curve shape than an error of the same magnitude in another parameter.
Ultimately, the design of a loss function depends on the task: it must optimize a relevant metric for the task of interest. For lane detection, we opt for a loss function that has a geometric interpretation: it minimizes the squared area between the predicted curve and the ground truth curve in the image plane, up to a point (see Figure 2):
For a straight line , this results in:
where . For a parabolic curve it gives:
Before feeding the list of weighted coordinates to the fitting module, the coordinates can optionally be transformed to another reference frame by multiplying them with a transformation matrix . This multiplication is also a differentiable operation through which backpropagation is possible.
In lane detection, for example, a lane line is better approximated as a parabola in the ortho-view (i.e., the top-down view) than as a parabola in the original image reference frame. We can achieve this by simply transforming the coordinates from the image reference frame to the ortho-view by multiplying them with a homography matrix . The homography matrix is considered known in this case. Note that it is not the input image that is transformed to the ortho-view, only the list of coordinates.
The proposed method is general in nature. To better understand the dynamics of backpropagating through a least-squares fitting procedure, we first provide a simple toy experiment. Next, we evaluate out method on the real-world task of lane detection. Our goal is not to extensively tune the method and equip it with bells and whistles to reach state-of-the-art performance on a lane detection benchmark, but to illustrate that end-to-end training with our method outperforms the classical two-step procedure in a fair comparison.
Recall that the least-squares fitting module takes a list of weighted coordinates as input and produces the parameters of the best-fitting curve (e.g. a polynomial) through these points as output. During training, the predicted curve is compared to the ground truth curve (e.g. an annotated lane) in the loss function, and the loss is backpropagated to the inputs of the module. Note that we only discussed backpropagating the loss to the coordinate weights, generated by a deep network, but that it is in principle also possible to backpropagate the loss to the coordinates themselves.
This is illustrated in Figure 3. The blue dots represent coordinates and their size represents their weight. The blue line is the best-fitting straight line (i.e., first-order polynomial) through the blue dots, in the weighted least-squares sense. The green line represents the ground truth line, and is the target. If we design the loss as in Section 2.3 and backpropagate through the fitting module, we can iteratively minimize the loss through gradient descent such that the predicted line converges towards the target line. This can happen in three ways:
By updating the coordinates while keeping their weights fixed. This corresponds to the first row in the figure, where the blue dots move around but keep the same size.
By updating the weights while keeping the coordinates fixed. This corresponds to the second row in the figure, where the blue dots change size but stay at the same location.
By updating both the coordinates and the weights . This corresponds to the third row in the figure, where the blue dots move around and change size at the same time.
For the lane detection task in the next section we focus on the second case, where the coordinate locations are fixed, as they represent image pixel coordinates that lie on a regular grid. The loss is thus only backpropagated to the coordinate weights, and from there further into the network that generates them, conditioned on an input image.
We now turn to the real-world task of ego-lane detection. To be more precise, the task is to predict the parameters of the two border lines of the ego-lane (i.e. the lane the car is driving in) in an image recorded from the front-facing camera in a car. As discussed before, the traditional way of tackling this task is with a two-step pipeline in which features are detected first, and a lane line model is fitted to these features second. The lane line model is typically a polynomial or spline curve. In this experiment, the lines are modeled as parabolic curves in the ortho-view. The network must predict the parameters and of each curve from the untransformed input image. The error is measured as the normalized area between the predicted curve and the ground truth curve in the ortho-view up to a fixed distance from the car:
This error is averaged over the two lane lines of the ego-lane and over the images in the dataset.
Again, it is not our goal to outperform the sophisticated and highly-tuned lane detection frameworks found in the literature, but rather to provide an apples-to-apples comparison of our proposed end-to-end method to a classical two-step pipeline. Extensions like data augmentation, more realistic lane line models, and an optimized base network architecture are orthogonal to our approach. In order to provide a fair comparison, we train the same network in two different ways, and measure its performance according to the error metric. We compare following two methods:
In this setting, the network that generates the pixel coordinate weights (two weight maps: one for each lane line) is trained in a segmentation-like manner, with the standard per-pixel binary cross-entropy loss. This corresponds to the feature detection step in a two-step pipeline. The segmentation labels are created from the ground truth curve parameters by simply drawing the corresponding curve with a fixed thickness as a dense label. At test time, a parabola is fitted through the predicted features in the least-squares sense. This corresponds to the fitting step in a two-step pipeline.
In this setting, the network is trained with our proposed method involving backpropagation through a weighted least-squares fit and the geometric loss function. There is no need to create any proxy segmentation labels, as the supervision is directly on the curve parameters. It corresponds to the second case in Section 3.1.
We run our experiment on the TuSimple lane detection dataset [TuSimple(2017)]. We manually select and clean up the annotations of 2535 images of the dataset, filtering out images where the ego-lane cannot be detected unambiguously (e.g. when the car is switching lanes). 20% of the images are held out for validation, taking care not to include images of a single temporal sequence in both training and validation set.
is used as the network architecture. The last layer is adapted to output two feature maps, one for each ego-lane line. In both the cross-entropy and end-to-end experiments, we train for 350 epochs on a single GPU with image resolution of 256x512, batch size of 8, and Adam[Kingma and Ba(2015)] optimizer with a learning rate of 1e-4. As a simple data augmentation technique the images are randomly flipped horizontally. In the end-to-end experiments, we use a fixed transformation matrix H to transform the weighted pixel coordinates to the ortho-view. Note that the input image itself is not transformed to the ortho-view, although that would also be an option. The system is implemented in PyTorch [Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer].
Figure 4 shows the error (i.e., the normalized area between ground truth and predicted curve) during training for both methods, on the training set (left) and on the validation set (right). Table 1 summarizes these results and also reports the value of the geometric loss (see Section 2.3), which is the actual metric being optimized in our end-to-end method.
We see that our end-to-end method converges to lower error than the method trained with cross-entropy loss, both on the training and validation set. The convergence is slower, but this should come as no surprise: the supervision signal in the end-to-end method is much weaker than in the cross-entropy method with dense per-pixel labels. To see this, consider that the end-to-end method does not explicitly force the weight map to be a segmentation of the actual lane lines in the image. Even if the network generated a seemingly random-looking weight map, the loss (and thus the gradients) would still be zero, as long as the least-squares fit through the weighted coordinates would coincidentally correspond to the ground truth curve. For example, the network could fall into a local minimum of generating the weight map based on image features such as the vanishing point at the horizon and the left corner of the image, still resulting in a relatively well fitting curve but hard to improve upon. One option to combine the fast convergence of the cross-entropy method with the superior performance of the end-to-end method would be to pre-train the network with the first and fine-tune with the latter.
Figure 5 shows some qualitative results of our method. Despite the weak supervision signal, the network seems to eventually discover that the most consistent way to satisfy the loss function is to focus on the visible lane markings in the image, and to map them to a segmentation-like representation in the weight maps. The network learns to handle the large variance in lane markings and can tackle challenging conditions like the faded markings in the bottom example of Figure 5.
To be robust against outliers in the fitting step, classic methods often resort to iterative optimization procedures like RANSAC [Fischler and Bolles(1981)], iteratively reweighted least-squares [Holland and Welsch(1977)] and iterative closest point [Besl and McKay(1992)]. In our end-to-end framework, the network learns a mapping from input image to weight map such that the fitting step becomes robust. This moves complexity from the post-processing step into the network, allowing for a simple one-shot fitting step.
In this work we proposed a method for estimating lane curvature parameters by solving a weighted least-squares problem in-network, where the weights are generated by a deep network conditioned on the input image. The network was trained to minimize the area between the predicted lane lines and the ground truth lane lines, using a geometric loss function. We visualized the dynamics of backpropagating through a weighted least-squares fitting procedure, and provided an experiment on a real-world lane detection task showing that our end-to-end method outperforms a two-step procedure despite the weaker supervision signal. The general idea of backpropagating through an in-network optimization step could prove effective in other computer vision tasks as well, for example in the framework of active contour models. In such a setting, the least-squares fitting module could perhaps be replaced with a more versatile differentiable gradient descent module. This will be explored in future work.
Acknowledgement: This work was supported by Toyota, and was partially carried out at the TRACE Lab at KU Leuven (Toyota Research on Automated Cars in Europe).
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL https://www.tensorflow.org/. Software available from tensorflow.org.
Learning shape correspondence with anisotropic convolutional neural networks.In Advances in Neural Information Processing Systems, 2016.
Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 318–326, 2012.