Recent advances in self-driving vehicles have been very impressive. Drivable area extraction is a key technology in this domain and a prerequisite for safe and reliable autonomous driving. Currently, the more mature techniques have mainly been designed for urban structured road environments , but few studies have focused on off-road environments. In off-road environments, there are no structured features such as traffic lanes, paved road surfaces or guardrails. The off-road drivable area usually has ambiguous margins, various textures and complex features, which creates considerable challenges extracting the drivable area. As a result, the algorithms designed for the urban environment are difficult to apply directly to the off-road environment.
Cameras and LiDAR are two main sensors that provide input data for drivable area extraction tasks. There are many camera-based methods that have already been applied in off-road environments. However, the color or texture features they used are not robust enough in diverse illumination and weather conditions. The lack of 3D information limits the performance and adaptability of these methods in different scenes as well. LiDAR has been widely used in self-driving systems because of its advantages in collecting 3D point cloud data directly. There are some LiDAR-based methods that depend on data segmentation and rule/threshold-based methods to extract the drivable area. However, these methods rely heavily on human-designed features and presupposed thresholds, and they usually have poor scene adaptability.
In this work, we focus on off-road drivable area extraction using 3D LiDAR data. To illustrate the main challenges of the off-road scene, we use light blue polygons to include some typical ambiguous areas in column (b) of Figure.1. A human driver would not enjoy driving in these areas because of their higher traversability costs, but they are technically drivable to some degree. It is unreasonable to simply label these ambiguous areas as either drivable zones or the obstacle zones, so they are called grey zones in this paper.
We propose a deep learning method for drivable area extraction using 3D LiDAR data specific to the off-road environment. Compared with traditional human-designed features, the proposed method can autonomously learn features of the drivable zone from the labelled data. Additionally, it is suitable for weakly supervised and semi-supervised learning. By combining the features from the vehicle paths and auto-generated vertical obstacles, our method can significantly decrease the demand for human annotation in the neural network training. The experimental results prove the validity of our proposed method.
Ii Related Works
Cameras are one of the most important sensors in the road/drivable area extraction tasks for autonomous vehicles. Some camera-based methods depend on the assumption of global road priors such as road boundaries, traffic lanes or vanish points . Some other studies do not rely on these assumptions but view the drivable area extraction as a segmentation of road and non-road regions. Furthermore, some stereo camera based approaches make use of depth information to help off-road drivable area extraction. Despite achieving good performance, camera-based methods are easily affected by changing illumination. The LiDAR-based methods can address this weakness, and the higher precision 3D information can be conveniently used to extract the road boundaries or fit the road plane. For the different characteristics of the two sensors, LiDAR-camera fusion becomes a natural solution. For example, Dahlkamp
identified a nearby drivable area by LiDAR and used it to train an image-based classifier for far-range drivable area detection.
The existing drivable area extraction methods are mostly designed for urban environment, but the problem in off-road environments is quite different. One fundamental problem is the ambiguous definition of the drivable area, as shown in Figure.1. Many studies have proposed similar concepts for off-road scenes from different perspectives, such as traversability analysis and drivable corridors. Despite some methods having already been implemented in off-road environments, they still have some limits. These methods usually focus on the mechanically drivable area but seldom distinguish whether these areas are likely to be chosen by a human driver, which makes great sense for autonomous vehicles.
Recently, many deep learning methods have achieved impressive results on related tasks. Compared to the traditional methods, deep networks can learn high-level semantic features directly from the data, which usually performs better than human-designed features. However, the deep learning methods usually rely on large human-annotated datasets such as KITTI and Cityscapes. For off-road environments, there are few widely used datasets for the drivable area extraction due to the ambiguous problem definition. To reduce the demand for human annotation, some studies have used a simulator to access endless data for training. In addition, other studies have focused on weakly supervised or semi-supervised methods, attempting to use auto-generated weak labels as substitution, which can be easily accessed.
This work focuses on drivable area extraction in off-road environments. We propose a LiDAR-based deep learning framework specific to the ambiguities in this task. To reduce the demand for human-annotated datasets, we also propose weakly supervised and semi-supervised methods to learn features from auto-generated labels.
Iii-a Problem Definition
We aggregate a few frames’ point clouds to obtain a dense bird’s-eye view height map , which is used as our input data format. The input height map is the size of and each pixel represents the physical height of pixel . The car is in the center of with an upward heading. The input examples can be seen in Figure.1(a).
Different from the well-defined road borders in structured urban environments, the main peculiarity in off-road environments is the ambiguous area beside the road margin, which is called the grey zone. To distinguish it with others, we let , and we use to denote the human annotated ground truth, where .
The original output of our proposed framework is a cost map , where each evaluates the traversability cost of pixel . Our proposed framework learns a mapping from input to cost map .
For conveniently comparing with the human-annotated ground truth and other baseline methods, we use Equation (4) to get label .
Therefore, the problem in this work can be formulated as learning a multi-class classifier that maps input to a label .
Iii-B Network Architecture
Due to the ambiguity of the grey zone, we hold the view that classifying it as an independent label from the others is not reasonable enough. In some cases, the grey zone is technically drivable but not human-desired, which is very close to or even overlaps with the drivable zone in the feature space. In other cases, the grey zone may have a higher traversability cost than the common drivable zone, which is closer to or even overlaps the obstacle zone in feature space. As a result, viewing the ambiguous grey zone as an independent label in the training process will cause confusion for the deep learning model, and the experimental results in Section.V give evidence of this viewpoint.
Therefore, we can assume that the grey zone samples are distributed between the drivable samples and the obstacle samples in the feature space. The key idea of our proposed method is learning two classification surfaces to separate the grey zone samples. One classification surface is used to separate the drivable zone samples from the other samples. The other classification surface is used to separate the obstacle zone samples. As a result, we can evaluate the traversability cost of a sample by its feature distance to the two surfaces.
As shown in Figure.2
, the proposed network has two branches for learning the two classification surfaces mentioned above. Both of these branches are designed according to the common VGG-based fully convolutional network. The difference is that the last layer does not output the discrete labels but the probabilistic predictions from the last softmax layer. We denote them asand , which are in the range of .
Each branch is trained end-to-end guided by the following cross-entropy loss function:
where is the name of the network’s branch.
is the probability that pixelis predicted as label with the network parameters . We use and to represent the and .
When training the network, a different label, , is used in the two branches.
is the one-hot vector of label. and . Concretely, when training the drivable branch, we replace all the pixels satisfied with label to obtain . When training the obstacle branch, we replace all the pixels satisfied with label to obtain .
We use the following regulation to calculate the traversability cost map and the discrete label:
where and are hyper-parameters, the pixels satisfying the first condition are labelled as , the second is and the last is .
Iii-C Weakly and Semi-supervised Learning
To reduce the demand for the high-cost human-annotated data, we propose a semi-supervised learning method shown in Figure.3. For our weakly supervised method, no human-annotated labels are used for training. In addition, for our semi-supervised method, this framework can combine a large number of auto-generated labels (see Section.IV-A for more details) and only a small fraction of human-annotated labels to train the network. Except for the numerous auto-generated labels, this framework is almost the same as the fully supervised framework. Our proposed network receives the LiDAR-based height maps as the inputs and outputs the traversability cost map for each pixel. For the convenience of evaluation, the cost map is discretized to the 3-class result and projected on the image for visualization.
For human-annotated data , we use the loss function Equation (2) for training. For data with only auto-generated weak labels, we define the loss in branch as below:
where is a regularization weight. is used for weakly and semi-supervised training, which has a similar definition as except for the pixels without auto-generated labels . means the unknown zone. represents the pixels without LiDAR observation, so they are labeled as unknown.
|Drivable Zone||Obstacle Zone|
|dri: Drivable zone||obs: Obstacle zone||G: Ground truth||Y: Predicted label||VP: Vehicle path||: Pixel number in X|
Iv Implementation Details
Iv-a Automatic Labelling
As illustrated in Figure.3, we use the recorded data from the data collection car to achieve the auto-generated labels. We follow the rules-based region growing method described in Algorithm1 to generate the vertical obstacles from the LiDAR data as the weak obstacle zone labels. We only need to set a loose threshold for region growing, and we can obtain a relatively strict vertical obstacle area.
In addition, we assume that the vehicle path chosen by the human driver must belong to the drivable zone. Therefore, we project the data collection car’s GPS trajectories with the same width as the car to the input height map, and they are labelled as the weakly drivable zone.
Iv-B Training Setup
The training process and experiments are conducted on a NVIDIA TitanX GPU. The network is trained with the Adam optimizer with the learning rate 1e-4 and the batch size of 16. Data augmentation is applied mainly for image rotation because the vehicle seldom makes a turn. In semi-supervised learning loss, the regularization weight is usually less than 1, such as . When converting the cost map into discrete labels, we set the thresholds and to 0.5 for an SUV. Actually, they can be changed based on the through capacity of a vehicle.
Iv-C Evaluation Measures
To evaluate the quantitative performance of different algorithms in the off-road environment, we design some evaluation measures and present them in Table.I.
Due to the ambiguous definition of the grey zone, we do not evaluate the performance on the grey zone samples directly, but only the drivable zone and obstacle zone samples.
We define to evaluate the precision performance. For the drivable zone, . represents the number of pixels predicted as the drivable zone. measures the percentage of extracted drivable pixels that are the actual drivable zone in the ground truth. For the obstacle zone, measures the percentage of extracted obstacle pixels that belong to the obstacle zone in human annotations.
We define to evaluate the recall. For the drivable zone, , where represents the number of pixels in the drivable zone in the ground truth. measures the percentage of the ground truth drivable zone extracted by the proposed method. For the obstacle zone, is defined in a similar fashion.
is defined only for the drivable zone, which measures the percentage of the vehicle path extracted as the drivable zone. We believe that the vehicle paths chosen by the human driver must be the area with a relatively low traversability cost. Therefore, we design to encourage extracting the vehicle path pixels as the drivable zone.
Finally, the measure is a widely used indicator that considers both the precision and the recall. The measure is the harmonic average of the precision () and recall ().
In this work, there are some methods that tend to extract the drivable zone with a narrow-width, which is similar to a vehicle path. This leads to very high performance for precision () but lower recall (). In addition, there are some other methods that tend to extract a wider drivable zone, which leads to higher recall () but lower precision (). In these cases, the measure is considered the most important indicator for evaluating the method’s performance.
|Drivable zone||Obstacle zone|
|3-class FCN (fully sup.)||74.93||82.99||98.92||78.75||94.36||98.44||96.36|
|Ours (fully sup.)||76.01||86.72||98.09||81.01||96.20||96.75||96.47|
|RG-FCN (weakly sup.)||59.78||79.15||93.16||68.11||94.46||95.38||94.92|
|Oxford PP (weakly sup.)||97.00||47.38||83.71||63.66||98.40||89.84||93.93|
|Ours (weakly sup.)||72.38||78.83||95.21||75.47||96.31||94.84||95.57|
V Experimental Results
We build a typical off-road dataset using our data collection vehicle, which is equipped with a Velodyne HDL-64 LiDAR, a front-view monocular camera and a GPS/IMU system. To collect the input data , We project point clouds captured by the LiDAR into a birds-eye view. The position and posture information captured by the GPS/IMU system is required during the projection process. In addition, we use the GPS/IMU system to record the vehicle trajectory for automatically labelling the vehicle path. We emphasize that the camera data are only used for visualization.
During the data collection processing, the vehicle is driven by a human driver and all kinds of data are time-synchronized. The input height map is in the size of with 0.2 metres pixel size. The height value of each pixel is linearly projected to an integer in .
The whole dataset contains 1961 frames of data. The driving distance is approximately 785 meters. We choose frames for model training, frames for validation and frames for testing.
V-B Proposed Method Results
To evaluate our proposed method’s performance, we first compare it with a fully supervised method to prove the reasonability of our model design for the grey zone. Another advantage of our model is that it is also suitable for weakly and semi-supervised learning. Therefore, we compare our weakly and semi-supervised results with other baselines.
Figure.4 and Figure.5 show the qualitative test results of different methods in two typical scenes: a crossroad scene and a straight road scene. It is necessary to mention that the cost maps of other baseline methods are directly remapped from their output labels, which only have 3 discrete values.
V-B1 Fully Supervised Results
Fully supervised results use the human-annotated ground truth for model training. The baseline method ’3-class FCN’ is based on a fully convolutional network with the same depth as branches in our proposed network. It treats the problem as a common 3 class classification. Our proposed fully supervised method achieves better performance than ’3-class FCN’ in most evaluation measures.
We can use the first three columns in Figure.4 for a more specific example. Our fully supervised method is more robust than the common 3-class classification method in the complex crossroad scene. The baseline ’3-class FCN’ misclassifies the left side crossroad as an obstacle zone and our method successfully extracts the whole structure of this crossroad.
V-B2 Weakly Supervised Results
The weakly supervised method means that all data for model training are auto-generated weak labels.
We introduce two baseline methods for comparison. The first one is denoted as ’RG-FCN’, which uses the traditional rule-based region growing method described in Algorithm1 to generate weak labels. These weak labels are used to train a FCN with cross-entropy loss function. The second baseline method ’Oxford path proposal’ was original designed for the task of path proposal based on image data. It achieved great performance on KITTI dataset. Due to the lack of LiDAR-based weakly supervised methods, we re-implement this method in our LiDAR-based framework as another baseline.
The qualitative visualization results are shown in the last four columns of Figure.4 and Figure.5. From the visualization results, it is easy to find that the ’RG-FCN’ method tends to extract a wider drivable zone than the ground truth. The rule-based method cannot distinguish the drivable zone and the grey zone with a few of the thresholds. The ’Oxford path proposal’ results are opposite in that they have a very narrow drivable zone similar to the vehicle path. Its fundamental defect is that too many pixels between this narrow drivable zone and the obstacle zone are labelled as unknown. A large percentage of these pixels are actually drivable, and their accurate prediction makes great sense for autonomous vehicles. Compared with these two baseline methods, our weakly supervised method has obviously better performances in extracting the drivable area. In addition, our method is also as robust as the fully supervised method in the crossroad scene.
The quantitative analysis shows that our weakly supervised method can extract the drivable area more accurately than others. Despite one baseline method that has a higher precision , and the other method that has slightly higher recall , they both obtained very poor performance of the other evaluation measure. In other words, our method is the most balanced method, which obtained an measure that was 7.4% higher than the ’RG-FCN’ and 11.8% higher than the ’Oxford path proposal’ method.
V-B3 Semi-supervised Results
The semi-supervised methods mean a proportion of human-annotated and auto-generated data are used for training at the same time. Regardless of the number of human-annotated labels, we use all weak labels for training, for their quite low generating cost. We list the semi-supervised result using half human-annotated labels for training as representative in Table.II. The measure of our semi-supervised method (81.73%) achieves an impressive improvement compared to other baseline methods, and it was even higher than the fully supervised result. Our method achieves better evaluations than other methods in all measures except precision ; the explanation of this is similar to that of the weakly supervised models.
|-||6.25% semi-sup.||12.5% semi-sup.||25% semi-sup.||50% semi-sup.||fully sup.|
To explore how the ratio of human-annotated labels influences our model performance, we compare the semi-supervised methods using different ratios of human-annotated labels based on the key indicator measure. The detailed performance on the test set can be seen in Figure.6. We split the test set into 10 batches and evaluated on them separately. The quantitative results are shown in Table.III. The percentage in the front of ’semi-sup’ represents the ratio of human-annotated labels used for training. The measure of the 50% semi-supervised version is higher than the fully supervised method, and the 25% semi-supervised version achieves higher performance than the ’3-class FCN’ baseline with only a quarter of human-annotated labels. It shows that our proposed semi-supervised method can significantly reduce the demand for high-cost human annotations.
In this paper, we propose a deep learning framework for off-road drivable area extraction. The proposed network structure is specifically designed for the ambiguous grey zone in the off-road environment. We also propose an automatic labelling method that generates quantities of weak labels from the vehicle’s driving data collection. Our method can significantly reduce the demand for human-annotated data for the weakly and semi-supervised network training. Importantly, it is demonstrated that the proposed semi-supervised method can achieve better performance than the fully supervised method with even fewer human-annotated labels. In this work, the camera images are only used for visualization, but they actually include many useful features for the drivable area extraction, such as the colors and textures. We plan to fuse the camera information into our framework and enhance the robustness of far-field drivable area extraction in the off-road environment.
-  Aharon Bar Hillel, Ronen Lerner, Dan Levi, and Guy Raz. Recent progress in road and lane detection: A survey. Machine Vision and Applications, 25(3):727–745, 2014.
-  Yinghua He, Hong Wang, and Bo Zhang. Color-Based Road Detection in Urban Traffic Scenes. 5(4):309–318, 2004.
-  José M. Álvarez and Antonio M. Ĺopez. Road detection based on illuminant invariance. IEEE Transactions on Intelligent Transportation Systems, 12(1):184–193, 2011.
-  Jilin Mei, Yufeng Yu, Huijing Zhao, and Hongbin Zha. Scene-Adaptive Off-Road Detection Using a Monocular Camera. IEEE Transactions on Intelligent Transportation Systems, 19(1):242–253, 2018.
-  Liang Xiao, Bin Dai, Daxue Liu, Tingbo Hu, and Tao Wu. CRF based road detection with multi-sensor fusion. IEEE Intelligent Vehicles Symposium, Proceedings, 2015-August(Iv):192–198, 2015.
-  Wende Zhang. LIDAR-based road and road-edge detection. IEEE Intelligent Vehicles Symposium, Proceedings, pages 845–848, 2010.
-  Yuan Yuan, Zhiyu Jiang, and Qi Wang. Video-based road detection via online structural learning. Neurocomputing, 168:336–347, 2015.
-  Mohamed Aly. Real time Detection of Lane Markers in Urban Streets. pages 7–12, 2014.
-  ZuWhan Kim. Robust Lane Detection and Tracking in Challenging Scenarios. Ieee Transactions on Intelligent Transportation Systems, 9(1):16–26, 2008.
-  Jose M. Álvarez, Antonio M. López, Theo Gevers, and Felipe Lumbreras. Combining priors, appearance, and context for road detection. IEEE Transactions on Intelligent Transportation Systems, 15(3):1168–1178, 2014.
-  Jean-yves Audibert, Jean Ponce, and Ponts Paristech. General Road Detection From a Single Image. IEEE Transactions on Image Processing, 19(8):2211–2220, 2010.
Shengyan Zhou and Karl Iagnemma.
Self-supervised learning method for unstructured road detection using fuzzy support vector machines.IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings, pages 1183–1189, 2010.
-  Arturo L Rankin, Andres Huertas, and Larry H Matthies. Stereo-vision-based terrain mapping for off-road autonomous navigation. In Unmanned Systems Technology XI, volume 7332, page 733210. International Society for Optics and Photonics, 2009.
-  Alberto Broggi, Elena Cardarelli, Stefano Cattani, and Mario Sabbatelli. Terrain mapping for off-road autonomous ground vehicles using rational b-spline surfaces and stereo vision. In 2013 IEEE Intelligent Vehicles Symposium (IV), pages 648–653. IEEE, 2013. c
-  W. S. Wijesoma, K. R.S. Kodagoda, and Arjuna P. Balasuriya. Road-boundary detection and tracking using ladar sensing. IEEE Transactions on Robotics and Automation, 20(3):456–464, 2004.
-  Yihuan Zhang, Jun Wang, Xiaonian Wang, and John M. Dolan. Road-Segmentation-Based Curb Detection Method for Self-Driving via a 3D-LiDAR Sensor. IEEE Transactions on Intelligent Transportation Systems, pages 1–11, 2018.
-  Alireza Asvadi, Cristiano Premebida, Paulo Peixoto, and Urbano Nunes. 3D Lidar-based static and moving obstacle detection in driving environments: An approach based on voxels and multi-region ground planes. Robotics and Autonomous Systems, 83:299–311, 2016.
-  Xiao Hu, F. Sergio A. Rodriguez, and Alexander Gepperth. A multi-modal system for road detection and segmentation. IEEE Intelligent Vehicles Symposium, Proceedings, (Iv):1365–1370, 2014.
-  H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. Bradski. Self-supervised Monocular Road Detection in Desert Terrain. Robotics: Science and Systems II, 2006.
-  Benjamin Suger, Bastian Steder, and Wolfram Burgard. Traversability analysis for mobile robots in outdoor environments: A semi-supervised learning approach based on 3D-lidar data. Proceedings - IEEE International Conference on Robotics and Automation, 2015-June(June):3941–3946, 2015.
-  Ara V. Nefian and Gary R. Bradski. Detection of drivable corridors for off-road autonomous navigation. Proceedings - International Conference on Image Processing, ICIP, (Figure 1):3025–3028, 2006.
-  Dan Barnes, Will Maddern, and Ingmar Posner. Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 203–210. IEEE, 2017.
Xiaofeng Han, Jianfeng Lu, Chunxia Zhao, and Hongdong Li.
Fully Convolutional Neural Networks for Road Detection with Multiple Cues Integration.2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4608–4613, 2018.
-  Gabriel L. Oliveira, Wolfram Burgard, and Thomas Brox. Efficient deep models for monocular road segmentation. IEEE International Conference on Intelligent Robots and Systems, 2016-November:4885–4891, 2016.
Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao.
Deepdriving: Learning affordance for direct perception in autonomous
Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015.
-  Mauro Bellone, Giulio Reina, Luca Caltagirone, and Mattias Wahde. Learning Traversability from Point Clouds in Challenging Scenarios. IEEE Transactions on Intelligent Transportation Systems, 19(1):296–305, 2018.
-  Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.