Accurate and efficient scene perception of urban environments are crucial for various applications, including HD mapping, autonomous driving, 3D model reconstruction, and smart city .
In the past decade, the largest portion of research in urban mapping is using 2D satellite and airborne imagery [5, 14], and utonomous driving researches also relied heavily on 2D images captured by digital cameras .
Compared with 2D images that are short of georeferenced 3D information, 3D point clouds collected by LiDAR sensors has become desirable for urban studies [15, 22].
However, point clouds are unstructured, unordered and are usually in a large volume . Deep learning algorithms have shown advantages tackling these challenges in point cloud processing in various tasks, including semantic segmentation [2, 21], object detection [4, 34], classification [13, 20], and localization [6, 30].
Mobile platforms that integrate MLS sensors, location sensors (e.g. Global Navigation Satellite Systems (GNSS)), and cameras cameras (e.g. panoramic and digital cameras) are gaining popularity in urban mapping and autonomous driving due to the flexibility of data collection [10, 33], but training effective deep learning models is not feasible without high-quality labels of the point clouds . The development of deep learning has always been driven by high-quality datasets and benchmarks . They allow researchers to focus on improving performance of algorithms without the hassle of collecting, cleaning and labeling large amount of data. They also ensure the performance of the algorithms are comparable with each other.
In this paper, Toronto-3D, a new large-scale urban outdoor point cloud dataset acquired by MLS system is presented.
This dataset covers approximately 1km of point clouds and consists of about 78.3 million points. An example of the proposed dataset is shown in Fig. 1. The main contributions of this paper are:
present a large-scale point-wise labeled urban outdoor point cloud dataset for semantic segmentation,
investigate an integrated network for point cloud semantic segmentation,
provide an extensive comparison on the performance of state-of-the-art deep learning semantic segmentation methods on the proposed dataset.
2 Available point cloud datasets for 3D Semantic Segmentation
With the advancement of LiDAR and RGB-D sensors, and the development of autonomous driving and 3D vision, point cloud data has became more and more accessible. However, such datasets usually have very large volume of data and contain lots of noise, making it difficult and time-consuming to produce high-quality manual labels. Popular accessible outdoor point cloud datasets for semantic segmentation are as follow:
|Dataset||Year||Primary fields||Length||# points||
|Oakland ||2009||x, y, z, label||1510 m||1.6 M||44||5||SICK LMS|
|200 m||12 M||22||8||
|Paris-Lille-3D ||2018||x, y, z, intensity, label||1940 m||143.1 M||50||9||
|SemanticKITTI ||2019||x, y, z, intensity, label||39.2 km||4.5 B||28||25||
|1000 m||78.3 M||8||8||
Oakland 3-D  is one of the earliest outdoor point cloud datasets acquired by a MLS system mounted with a side-looking SICK LMS sensor. The sensor is a mono-fiber LiDAR, and the point density is relatively low. This dataset contains about 1.6 million points and was labeled 44 classes. However, only 5 classes: vegetation, wire, pole, ground, and facade, are available for training and evaluation. This dataset is relatively small so that it is more suitable for developing and testing lightweight networks.
iQmulus  dataset comes from the IQmulus & TerraMobilita Contest acquired by a system called Stereopolis II  in Paris. A monofiber Riegl LMS-Q120i LiDAR was used to collect the point clouds. The full dataset has over 300 million points labeled with 22 classes, but only a small part of the dataset of 12 million points in a 200 m range with 8 valid classes was publicly available for the contest dataset. This dataset suffers unsatisfactory quality of classification due to occlusion from the monofiber LiDAR sensor and the annotation process .
Semantic3D  is collected by terrestrial laser scanners, having much higher point density and accuracy compared with the other datasets. However, only very limited viewpoints are feasible for static laser scanners, and similar datasets are not easily acquired in practice.
Paris-Lille-3D  is one of the most popular outdoor point cloud datasets in recent years. The dataset was collected with a MLS system using Velodyne HDL-32E LiDAR, with a point density and measurement accuracy closer to point cloud data acquired by autonomous driving vehicles. The dataset covers close to 2km with over 140 million points, and very detailed labels of 50 classes were provided. For benchmarks, the dataset uses 9 classes for the purpose of semantic segmentation.
SemanticKITTI  is one of the most recent and largest publicly available datasets serving the purpose of semantic segmentation. The dataset was further annotated on the widely used KITTI dataset . This dataset contains about 4.5 billion points covering close to 40 km, and it is labeled by each sequential scan with 25 classes for the evaluation of semantic segmentation. This dataset is more focused on algorithms towards autonomous driving.
Development and validation of deep learning algorithms demand more datasets in different parts of the world with various object labels. Toronto-3D is introduced in this paper to provide an additional high-quality point cloud dataset for 3D semantic segmentation with new labels. Table 1 shows a comparison of comprehensive indicators of different datasets.
3 New dataset: Toronto-3D
3.1 Data acquisition
The point clouds in this dataset were acquired with a vehicle-mounted MLS system: Teledyne Optech Maverick111http://www.teledyneoptech.com/en/products/mobile-survey/maverick/. The system consists of a 32-line LiDAR sensor, a Ladybug 5 panoramic camera, a GNSS system, and a Simultaneous localization and mapping (SLAM) system. The collected point clouds were further processed with LMS Pro222https://www.teledyneoptech.com/en/products/software/lms-pro/ software. Natural color (RGB) was assigned to each point according to the imaging camera.
3.2 Description of the dataset
This dataset was collected on Avenue Road in Toronto, Canada, covering approximately 1 km of road segment with approximately 78.3 million points. This dataset is divided into four sections, and each section covers a range of about 250 m. An overview of the approximate boundary of each section is illustrated in Fig. 2.
This dataset is collected using a 32-line LiDAR sensor, and the point clouds have high density of about 1000 points/m2 on the ground on average. The dataset covers the full range of the MLS sensor of approximately 100 m away from the road centerline without trimming. Limited post-processing was done to resemble real-world point cloud collection scenarios.
Each of the four sections of the dataset was saved separately in .ply
files. The point clouds were classified and point-wise labels were assigned manually using CloudCompare333https://www.cloudcompare.org. Each point cloud file has the following 7 attributes:
x, y, z: Position of each point recorded in meters, in NAD83 / UTM Zone 17N
R, G, B: Natural color reflectance of red, green, blue of each point, recorded in integer [0, 255]
intensity: LiDAR reflectance intensity of each point, recorded in integer [0, 255]
GPS time: GPS time of when each point was collected, recorded in float format
scan angle index: Scan angle of each point in degree, recorded in integer [-13, 31]
label: Object class label of each point, recorded in integer [0, 8]
A sample of the labeled point cloud is shown in Fig. 1. Similar to previous datasets, the object class labels were defined as follow:
Road (label 1): Paved road surfaces, including sidewalks, curbs, parking lots
Road marking (label 2): Pavement markings on road surfaces, including driving lines, arrows, pedestrian crossings
Natural (label 3): Trees, shrubs, not including grass and bare soil
Building (label 4): Any parts of low and multi-story buildings, store fronts
Utility line (label 5): Power lines, telecommunication lines over the streets
Pole (label 6): Utility poles, traffic signs, lamp posts
Car (label 7): Moving cars and parked cars on road sides and parking lots
Fence (label 8): Vertical barriers, including wooden fences, walls of construction sites
unclassified (label 0)
A summary of labels in each section is shown in Table 2.
|Section||Road||Road marking||Natural||Building||Power line||Pole||Car||Fence||unclassified||Total|
3.3 Challenges of Toronto-3D
The Toronto-3D dataset is comparable to Paris-Lille-3D in several aspects. They are both urban outdoor large-scale scenes collected by a vehicle-mounted MLS system with a 32-line LiDAR sensor. Toronto-3D covers approximately half the distance of Paris-Lille-3D dataset and includes half the number of points. They are both labeled with similar number of classes for the purpose of semantic segmentation. Different from Paris-Lille-3D, the Toronto-3D dataset has following characteristics that bring more challenges to effective point cloud semantic segmentation algorithms.
Full coverage of LiDAR measurement range. The Teledyne Optech Maverick MLS system used to acquire this dataset has a valid measurement distance of approximately 100 m. Different from Paris-Lille-3D where only points within approximately 20 m away from road centerline are available, Toronto-3D keeps all collected points within about 100 m. The full coverage of measurement range of Toronto-3D resembles point cloud data collection in real-world scenarios, and it brings challenges of variations of point density at different distances, inclusion of more noise, and inclusion of more objects further away from the sensor.
Variation of point density. Unlike the even distribution of point density in Paris-Lille-3D, the Toronto-3D dataset has larger variations of point density of objects caused majorly by two reasons: inclusion of full LiDAR measurement range, and repeated scans during point cloud collection. The variations of point density is illustrated in Fig. 3. As illustrated in the scene, the cars (colored in orange) on the streets have much higher point density compared to the parked cars at the upper-middle in the image. The cars with lower density are approximately 30-40m away from the road centerline, which means such scenarios would not be included in Paris-Lille-3D. In addition, at the center area in the scene, point density is significantly higher (over 10 times higher) compared to other parts of the scene, and this is caused by repeated scans when the vehicle stopped at the intersection during data collection. The repeated scans resulted in variations of point density on the same building at the same distance to the sensor at different locations. The large variation of point density would be challenging to test the robustness of algorithms to capture features effectively.
New challenging classes. There are two class labels not commonly seen in other popular datasets listed in Table 1, i.e., road marking and utility line. Road markings includes various pavement markings on the road surface, including pedestrian crossings and lane arrows, with various sizes and shapes, and they are difficult to distinguish from road surfaces. The utility lines are thin linear objects that are challenging to identify, especially in areas where they overlap with poles, trees and are close to buildings. In addition, the fence class that covers various wall-like vertical structures is also challenging to identify.
4 Baselines of semantic segmentation
4.1 State-of-the-art methods
Semantic segmentation of point cloud is to make predictions on each point with a semantic label. With the recent development of 3D deep learning, semantic segmentation tasks can be achieved by end-to-end deep neural networks. Existing 3D deep learning models on point clouds can be roughly generalized into three categories: view-based models, voxel-based models and point-based models.
The view-based models such as MVCNN  project 3D point clouds into multiple views as 2D images, but they do not fully use the rich 3D information. Voxel-based models such as VoxNet  and 3D-CNN  structure unordered point clouds into voxel grids, so that known structures and methods of 2D images can be extended to 3D space. However, the nature of point clouds that they are sparse and have various densities make voxelization inefficient.
Point-based methods directly process unordered and unstructured point clouds to capture 3D spatial features. Starting from PointNet 
which learns point-wise spatial feature with multi-layer perceptron (MLP) layers, point-based method has been greatly developed, followed by PointNet++, PointCNN . Graph models were also applied to extract spatial features in point-based models, and such methods include ECC  and DGCNN .
Six state-of-the-art point-based deep learning models were tested on the proposed dataset as baseline approaches:
DGCNN : DGCNN constructs graphs to extract local geometric features from local neighborhoods, and apply EdgeConv as a convolution-like operation. EdgeConv is isotropic about input features with convolutional operations on graph nodes and their edges.
KPFCNN : KPFCNN introduces a convolutional function called KPConv to capture local features within a range with weights defined by a set of kernel points. KPConv is robust to varying densities of point clouds and is computationally efficient. KPFCNN is currently ranked first in Paris-Lille-3D benchmark.
TGNet : TGNet introduces a novel graph convolution function called TGConv defined as products of point features from local neighborhoods. The features are extracted with Gaussian weighted Taylor kernel functions. An end-to-end semantic segmentation network is constructed with hierarchical TGConv followed by a conditional random field (CRF) layer.
Based on the above different methods, a new integration network, named MS-TGNet was proposed in this paper.
Considering the full range of approximately 100m from the road centerline was preserved in this dataset, the density of point cloud have a large difference at different distances to the sensor. Multi-scale grouping (MSG) proposed in PointNet++  was designed to capture features more effectively in point clouds with large variations in point density. A revised structure of TGNet  called MS-TGNet is proposed to evaluate the effectiveness of MSG layers on this new dataset. The revised structure with a U-shaped structure with 3 layers is illustrated in Fig. 4. A MSG layer was utilized in the second layer of the original TGNet architecture to capture local geometric features at three different radius. In addition, the CRF layer was removed from the network according to the negative impacts on the results of ablation analysis.
4.3 Evaluation metrics
For the evaluation of semantic segmentation results, intersection over union () of each class, overall accuracy () and mean IoU () are used.
where is the total number of labels, is the th label in , , and represent number of points of true positives, false positives and false negatives of the predictions respectively. and evaluate the overall quality of semantic segmentation, and of each class measures the performance on each class.
4.4 Results and discussions
|Methods||Road||Rd mrk.||Natural||Building||Util. line||Pole||Car||Fence|
|PointNet++ -MSG ||90.58||53.12||90.67||0.00||86.68||75.78||56.20||60.89||44.51||10.19|
Among the four sections of the dataset, L002 was selected as the testing set due to its smaller number of points and balanced number of points of each label, while the other three sections were used for training and validation.
For fair comparison, only coordinates () of point clouds were used in this paper, and the results are shown in Table 3.
Note that the corresponding model source codes from GitHub of the state-of-the-art methods were directly adopted without parameter tuning.
The results were achieved using a NVIDIA RTX 2080Ti with 11G of RAM, and batch sizes were adjusted accordingly.
The results are for baseline illustration purpose only, and better results could be achieved with other parameter settings.
PointNet++  achieved a of 56.55%, and achieved highest s in road and car classification. However, the PointNet++ model with MSG modules did not perform as well as the base PointNet++ architecture showing its potential limited effects in large-scale datasets. DGCNN  performed the worst in terms of both and
in our dataset. Since DGCNN uses KNN for construction of graphs to capture local features, it may not perform well in this dataset with various point density. The parameter settings of PointNet++ and DGCNN were directly used from the networks for indoor scenes, which limits the performance of these two algorithms to some extent.
KPFCNN  is on the top spot of Paris-Lille-3D benchmark, and it achieved the highest and among the tested state-of-the-art algorithms. The deformed KPConv operator has shown its advantage of adapting to different shapes over other methods, achieving the highest in building, utility line and fence segmentation. MS-PCNN  captures both point and edge features at multiple scales, and performed only next to KPFCNN. TGNet did not show an advantage over other networks, but of 58.34% can be achieved with the removal of the CRF layer (discussed later in ablation studies).
The proposed MS-TGNet achieved comparable results compared with KPFCNN, achieving the highest of 60.96% and next to highest of 91.69% in the baseline results. A visual comparison of semantic segmentation results of KPFCNN and MS-TGNet with ground truth is shown in Fig. 5. They both have good performance in terms of and , but KPFCNN failed to distinguish road markings from road surfaces by visual inspections. However, none the tested methods reached over 25% on road marking and fence classification.
4.5 Ablation analysis of MS-TGNet
Ablation studies were conducted on the effectiveness of adopting MSG in TGNet at different layers, and whether CRF layers improve performance in this dataset. Three test models that use MSG feature abstraction layers of different radius s were tested without CRF layers. Then CRF layers were added to the best performing model. The results of ablation analysis are shown in Table 4.
|TGNet =0.1,0.2,0.4,0.8 w/o CRF||58.34|
|TGNet =0.1,0.2,0.4,0.8 w/ CRF||55.23|
|MS-TGNet =[0.1,0.2,0.4],0.4,0.8 w/o CRF||58.19|
|MS-TGNet =0.1,[0.2,0.4,0.8],0.8 w/o CRF||60.96|
|MS-TGNet =0.1,[0.2,0.4,0.8],0.8 w/ CRF||56.25|
|MS-TGNet =[0.1,0.2,0.4],[0.2,0.4,0.8] w/o CRF||54.91|
Using MSG at the first layer did not improve the performance of the original TGNet network without CRF layers.
The scenario where MSG at the second layer with of [0.2, 0.4, 0.8] achieved the highest of 60.96%.
A two-layer MSG model did not perform as well as the other networks, possibly due to smaller batch sizes with the RAM restriction of our graphic card.
Another anlaysis on the impact of CRF layers showed that different from the result on Paris-Lille-3D dataset tested in , CRF layers resulted in worse performance at base TGNet network and the proposed MS-TGNet network. CRF can reduce noise in the segmentation, but it may result in errors in regions with lower point densities.
This paper presents Toronto-3D, a new large-scale urban outdoor point cloud dataset collected by a MLS system.
The dataset covers approximately 1km of road with over 78 million points in Toronto, Canada.
All points were preserved in the range of data collection to resemble real-world application scenarios.
This dataset was manually labeled into 8 categories, including road, road marking, natural, building, utility line, pole, car and fence.
Five state-of-the-art end-to-end point cloud semantic segmentation algorithms and a proposed network named MS-TGNet were tested as baselines for this dataset.
The proposed MS-TGNet is able to produce comparative performance with state-of-the-art methods, achieving the highest of 60.96% and a competitive of 91.69% in the new dataset. The Toronto-3D dataset provides new class labels including road markings, utility lines and fences, and all tested semantic segmentation methods need to improve on road markings and fences.
The intention of presenting this new point cloud dataset is to encourage developing creative deep learning models. The labels of this new dataset will be improved and updated with feedback from the research community.
Teledyne Optech is acknowledged for providing mobile LiDAR point cloud data. Thanks Jimei University for point cloud labeling.
SemanticKITTI: a dataset for semantic scene understanding of lidar sequences.
Proceedings of the IEEE International Conference on Computer Vision, pp. 9297–9307. Cited by: §1, 5th item, Table 1.
-  (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §1.
-  (2010) Autonomous driving in urban environments: approaches, lessons and challenges. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368 (1928), pp. 4649–4672. Cited by: §1.
-  (2015) 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pp. 424–432. Cited by: §1.
Review of automatic feature extraction from high-resolution optical sensor data for uav-based cadastral mapping. Remote Sensing 8 (8), pp. 689. Cited by: §1.
3D point cloud registration for localization using a deep neural network auto-encoder.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640. Cited by: §1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: 5th item.
-  (2017) SEMANTIC3D.NET: A new large-scale point cloud classification benchmark. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. IV-1-W1, pp. 91–98. Cited by: 3rd item, Table 1.
Point cloud labeling using 3d convolutional neural network. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2670–2675. Cited by: §4.1.
-  (2011) Towards fully autonomous driving: systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 163–168. Cited by: §1.
-  (2018) Pointcnn: convolution on x-transformed points. In Advances in neural information processing systems, pp. 820–830. Cited by: §4.1.
-  (2019) TGNet: geometric graph cnn on 3-d point cloud segmentation. IEEE Transactions on Geoscience and Remote Sensing. Cited by: 5th item, §4.2, §4.5, Table 3.
Deep learning with geodesic moments for 3d shape classification. Pattern Recognition Letters 105, pp. 182–190. Cited by: §1.
-  (2017) A review of supervised object-based land-cover image classification. ISPRS Journal of Photogrammetry and Remote Sensing 130, pp. 277–293. Cited by: §1.
-  (2019) Multi-scale point-wise convolutional neural networks for 3d object segmentation from lidar point clouds in large-scale environments. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1, 4th item, §4.4, Table 3.
-  (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §4.1.
-  (2009) Contextual classification with functional max-margin markov networks. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 975–982. Cited by: 1st item, Table 1.
-  (2012) Stereopolis ii: a multi-purpose and multi-sensor 3d mobile mapping system for street visualisation and 3d metrology. Revue française de photogrammétrie et de télédétection 200 (1), pp. 69–79. Cited by: 2nd item.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, 1st item, §4.1.
-  (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §1.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, 1st item, §4.1, §4.2, §4.4, Table 3.
-  (2011) Heritage recording and 3d modeling with photogrammetry and 3d scanning. Remote sensing 3 (6), pp. 1104–1138. Cited by: §1.
-  (2015) Vision-based offline-online perception paradigm for autonomous driving. In 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 231–238. Cited by: §1.
-  (2018) Paris-lille-3d: a large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification. The International Journal of Robotics Research 37 (6), pp. 545–557. External Links: Cited by: 2nd item, 4th item, Table 1.
-  (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §4.1.
-  (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §4.1.
-  (2019) KPConv: flexible and deformable convolution for point clouds. Proceedings of the IEEE International Conference on Computer Vision. Cited by: 3rd item, §4.4, Table 3.
-  (2011) Unbiased look at dataset bias. In CVPR 2011, pp. 1521–1528. Cited by: §1.
-  (2015) TerraMobilita/iqmulus urban point cloud analysis benchmark. Computers & Graphics 49, pp. 126–133. Cited by: 2nd item, Table 1.
-  (2018) Dels-3d: deep localization and segmentation with a 3d semantic map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5860–5869. Cited by: §1.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 1–12. Cited by: 2nd item, 4th item, §4.1, §4.4, Table 3.
-  (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: 4th item.
-  (2015) Hierarchical extraction of urban objects from mobile laser scanning data. ISPRS Journal of Photogrammetry and Remote Sensing 99, pp. 45–57. Cited by: §1.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1.