Are We Hungry for 3D LiDAR Data for Semantic Segmentation?

06/08/2020 ∙ by Biao Gao, et al. ∙ Peking University 0

3D LiDAR semantic segmentation is a pivotal task that is widely involved in many applications, such as autonomous driving and robotics. Studies of 3D LiDAR semantic segmentation have recently achieved considerable development, especially in terms of deep learning strategies. However, these studies usually rely heavily on considerable fine annotated data, while point-wise 3D LiDAR datasets are extremely insufficient and expensive to label. The performance limitation caused by the lack of training data is called the data hungry effect. This survey aims to explore whether and how we are hungry for 3D LiDAR data for semantic segmentation. Thus, we first provide an organized review of existing 3D datasets and 3D semantic segmentation methods. Then, we provide an in-depth analysis of three representative datasets and several experiments to evaluate the data hungry effects in different aspects. Efforts to solve data hungry problems are summarized for both 3D LiDAR-focused methods and general-purpose methods. Finally, insightful topics are discussed for future research on data hungry problems and open questions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 10

page 14

page 15

page 16

page 17

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Today, LiDAR has become the main sensor in many robotic [1][2], mobile mapping [3][4] and autonomous driving [5][6] systems. 3D LiDAR data, captured from either a static viewpoint [7] or a mobile platform [8] during a dynamic procedure, provide a copy of the real world with rich 3D geometry in true size, which can be represented in the format of either 3D point clouds [9][10] or 2D grids [11], e.g., range image, using a static or a sequence of data frames. Semantic segmentation [12][13]

is a fundamental task of scene understanding, which divides a whole piece of input data into different semantically interpretable categories according to a meaningful taxonomy in the real world. With the widespread use of LiDAR sensors in various applications, semantic segmentation of 3D LiDAR data

[14][15] is attracting increasing attention. Hereinafter, we refer to 3D semantic segmentation to emphasize the works addressing the features of 3D LiDAR data, and semantic segmentation for those of potentially general purpose.

Semantic segmentation has been studied for decades. A comprehensive review of early works up to 2014 is given in [16]. We refer to these works as traditional methods, which are characterized by using handcrafted features and bottom-up procedures. Inspired by the amazing success of deep learning techniques [17][18]

, recent semantic segmentation works have focused on using deep neural networks to learn a richer feature representation, and model the mapping from input data to semantic labels in an end-to-end procedure

[19], which are refered to as deep learning methods hereinafter. However, compared to traditional methods, deep learning methods face a considerable challenge of requiring large quantities of manually labeled data in training [20]. The quantity, quality and diversity of training data have a considerable influence on the generalization performance of deep learning models [21][22].

The performance limitation caused by insufficient training data is called the data hungry effect. As noted by G.Marcus in [23], against the background of considerable progress and enthusiasm, the data hungry problem was his first concern among the ten challenges faced by the current deep learning systems. For 3D semantic segmentation tasks, 3D LiDAR data with point-wise annotation are required, where S3DIS [24], Semantic3D [7], and SemanticKITTI [8] are among the most popular datasets. These datasets are annotated fully or partially by human operators, which is time consuming, human intensive, and requires special skill and software, e.g., the operators are trained to handle professional software to visualize and annotate 3D point clouds, which are much harder to interpret than 2D images. Due to these difficulties, the publicly available datasets for 3D semantic segmentation are very limited in both data size and scene variety compared with those of 2D images [25][26].

In this research, we seek to answer the following questions. Are we hungry for 3D LiDAR data for semantic segmentation using deep learning techniques? If yes, how hungry are we, and how can we solve the problem? To answer the questions, the following steps are taken in this work: 1) reviewing the most popular 3D LiDAR datasets by statistical analysis, 2) reviewing the state-of-the-art 3D semantic segmentation methods by focusing on deep learning methods, 3) experiments and cross validation on the most popular 3D LiDAR datasets using representative methods, 4) reviewing the efforts in the literature to solve the data hungry problem, 5) discussion and future directions to conclude an answer to the questions.

A number of surveys are relevant to this work. [27][28] review early methods methods of 3D point cloud segmentation and classification in the literature. [12][13][29] review methods and datasets using for semantic segmentation. Furthermore, [30][31][14][32][15][33] review the deep learning methods for 3D semantic segmentation task. In addition, [34]

reviews multi-modal methods used for semantic segmentation and detection. However, these surveys focuses on summarizing and classifying the existing methods, and none of them emphasize 3D datasets or the data hungry problem. To the best of our knowledge, this is the first work to analyze the data hungry problem for 3D semantic segmentation using deep learning techniques that are addressed in the literature review, statistical analysis, and cross-dataset and cross-algorithm experiments.

The main contributions of our work are as follows:

  • A broad and organized review of the existing 3D datasets and 3D semantic segmentation methods is provided, where deep learning methods are the focus as they face more challenges from the data hungry problem.

  • An in-depth analysis is conducted on three representative datasets to cross-examine their size and scene diversity. Several experiments are designed by using three state-of-the-art semantic segmentation methods for cross-scene and cross-dataset evaluation of data hungry effects. New findings are presented that provide guidance for future research.

  • A systematic summary of the efforts to solve data hungry problems is given on the aspects of both semantic segmentation methods that rely less on fine annotated data, and data annotation methods that are less labor intensive. The methods of incorporating domain knowledge in 3D LiDAR data processing and of general purposes are both reviewed.

  • An insightful discussion of potential future topics to solve the data hungry problem on both methodological and datasets viewpoints is given, followed by open questions that lead to important topics but have not been sufficiently studied until now.

The structure of this paper is as follows. Section II reviews the existing 3D LiDAR datasets and analyzes the size and diversity of three representative 3D datasets. Section III systematically summarizes the existing methods of 3D semantic segmentation. Section IV gives an evaluation of the data hungry effect on the 3D semantic segmentation task, and answers the question of whether we are hungry for 3D data for semantic segmentation. Section V reviews the existing strategies to deal with the data hungry effect. We not only summarize methods used for 3D semantic segmentation, but also explore methods applied in various areas to provide more references. Section VI gives a brief discussion of the data hungry effect based on our experiments and surveys. Finally, Section VII draws conclusions about this paper, and introduces some future works.

Ii 3D LiDAR Datasets and Statistical Analysis

sensor data type annotation dataset frames points/pixels classes scene* instance sequential year first author’s organization

3D LiDAR

static point-wise Oakland [35] 17 1.6M 44 o 2009 Carnegie Mellon University
Paris-rue-Madame [36] 2 20M 17 o 2014 MINES ParisTech
TerraMobilita/IQmulus[37] 10 12M 15 o 2015 University of Paris-Est
S3DIS [24] 5 215M 12 i 2016 Stanford University
Semantic3D [7] 30 4009M 8 o 2017 ETH Zurich
Paris-lille-3D [38] 3 143M 50 o 2018 MINES ParisTech
sequential point-wise Sydney Urban [39] 631 / 26 o 2013 Australian Center for Field Robotics
SemanticKITTI [8] 43552 4549M 28 o 2019 University of Bonn
SemanticPOSS [40] 2988 216M 14 o 2020 Peking University
3D-box KITTI [41] 14999 1799M 8 o 2012 Karlsruhe Institute of Technology
H3D [42] 27K / 8 o 2019 Honda Research Institute
nuScenes [43] 40K 2780M 23 o 2019 nuTonomy
Waymo [44] 230K 40710M 4 o 2020 Waymo LLC
synthetic point-wise GTA-V [168] / / / o 2018 University of California, Berkeley
SynthCity [46] 75000 367.9M 9 o 2019 University College London

image/RGB-D

image pixel-wise PASCAL VOC [25] 9993 / 20 i/o 2015 University of Leeds, Microsoft
Cityscapes [47] 24998 52425M 30 o 2016 Daimler AG RD, TU Darmstadt
RGB-D pixel-wise NYU-Depth V2 [48] 1449 445M 894 i 2012 New York University
ScanNet [49] 2500K 768000M 20 i 2017 Stanford University
ApolloScape** [50] 146997 1322973M 25 o 2018 Baidu Research
  • i means indoor, o means outdoor

  • ApolloScape provides depth data only for static street views without moving objects.

TABLE I: 3D LIDAR datasets with comparison to representative IMAGE AND RGB-D ones

Below, we review the publicly available 3D LiDAR datasets, followed by statistical analysis on three representative datasets.

Ii-a 3D LiDAR Datasets

According to data acquisition methods and main applications of the systems, the 3D LiDAR datasets listed in Table I are divided into three groups: 1) Static datasets: data collected from static viewpoints by terrestrial laser scanners or using MLS (Mobile Laser Scanning) systems that capture mainly static scene objects for applications such as street view, 3D modeling, and virtual realities. 2) Sequential datasets: data collected as sequences of frames from vehicular platforms for ADAS (Advanced Driving Assistance System) or autonomous driving applications, which can be further divided into datasets with point-wise or 3D bounding box annotations. 3) Synthetic datasets: data collected in a virtual world by simulating any of the above data acquisition systems. In addition, the most popular image and RGB-D datasets are also listed in Table I for comparison.

Fig. 1: Typical 3D LiDAR acquisition systems. (a) a terrestrial laser scanner[51], collects data from static viewpoints for static datasets, (b) a MLS (mobile laser scanning) system [38], collects data of mainly static scene objects for static datasets, (c) an autonomous driving system [41], collects 3D LiDAR streams for sequential datasets, (d) a simulation system [46] for synthetic datasets.

Ii-A1 Static datasets

Static datasets are most commonly used for point cloud classification tasks. Their main application scenarios include robotics, augmented reality and urban planning.

As shown in Fig. 1(a), terrestrial laser scanners are usually used to collect static 3D LiDAR data from fixed viewpoints and capture point clouds with very high angular resolutions, e.g., Semantic3D[7]. MLS systems such as Fig. 1(b) capture sequences of LiDAR frames from a moving vehicle, e.g., Paris-Lille-3D[38]. However, the data are generally static and reconstruct a large-scale street view, and the motion of dynamic objects at the scene is not captured.

A major feature of static datasets is dense point clouds. Some semantic segmentation models were originally developed based on such data, e.g., PointNet[10]. However, due to the static nature of the datasets, they are not adaptive in the developments for autonomous driving applications.

Ii-A2 Sequential datasets

Sequential datasets are most commonly used for autonomous driving tasks, such as semantic segmentation of traffic scenes, vehicle/pedestrian detection and tracking. As shown in Fig. 1(c), autonomous driving systems are exploited to capture the sequences of LiDAR frames with a moving viewpoint on the street, e.g., SemanticKITTI[8]. These datasets usually contain more frames but sparse points than static datasets. For example, Velodyne HDL-64 is a popular sensor of 3D LiDAR data collection, which has 64 lines within a vertical FOV (Field of View), corresponding to a vertical point resolution of . However, benefiting from the high frame rate of the sensor, the motion of moving objects is captured at 10 Hz.

Early datasets are annotated by 3D bounding boxes for the research of vehicle and pedestrian detection and tracking, e.g., KITTI[41]. Sequential datasets with point-wise annotation, e.g., SemanticKITTI[8], have appeared in recent years. The datasets with both point-wise and instance labels spawn research on 3D semantic segmentation[52] and panoramic segmentation[53].

Ii-A3 Synthetic datasets

The generation of real datasets is extremely expensive due to the labor intensiveness of data annotation; hence, the scales are limited. Synthetic datasets are built through computer simulation, as shown in Fig. 1(d), which can be large scale and have fine but cheap annotations. The problem of using such datasets is caused by the large gap between synthetic and real scenes. Synthetic scenes can generally be very realistic, but they lack accuracy in detail. For example, pedestrians in the GTA-V[168] dataset have RGB information with rich details, but their physical models are simplified into cylinders, and the resultant point clouds lack the necessary details of real objects. Some studies[11][54] have also shown that models trained solely on synthetic data do not perform well enough in real environments. Generating a computer graphic model of a synthetic scene is nontrivial. Reuse of the limited number of these models cannot enrich datasets with more diversity.

Ii-A4 Comparison with Image and RGB-D Datasets

A few representative image and RGB-D datasets are listed in Table I, which have much larger scales. Comparing to image and RGB-D datasets, it can be found that whatever Cityscapes [47] and ApolloScape [50] for semantic segmentation in autonomous driving scenes, or ScanNet [49] for indoor scenes, their number of pixels/frames are more sufficient than 3D LiDAR ones. Although the studies on image and RGB-D still face data hungry problem, it is more serious in the domain of 3D LiDAR datasets.

point number point proportion
Semantic3D 1344M 1740M 1861M 1921M 69.62% 90.12% 96.39% 99.52%
SemanticKITTI 1567M 2397M 2503M 2505M 62.54% 95.71% 99.94% 99.99%
SemanticPOSS 40M 153M 183M 193M 20.77% 78.42% 94.12% 98.91%
voxel number voxel proportion
Semantic3D 33K 240K 472K 663K 4.56% 33.19% 65.37% 91.77%
SemanticKITTI 30M 120M 161M 162M 18.37% 74.39% 99.51% 99.92%
SemanticPOSS 2M 23M 37M 43M 5.41% 51.08% 82.01% 96.18%
  • This table only considers points with labels.

TABLE II: Statistical analysis of POINT/VOXEL NUMBER and PROPORTION with respect to range DISTANCE of 3D LIDAR datasets
Fig. 2: Point/voxel proportion with respect to range distance (LiDAR points with valid labels are counted). The problem of unbalanced spatial distribution of LiDAR points can be alleviated by voxelization.
Fig. 3: Comparison of point/voxel proportion of a 3D LiDAR frame in Semantic3D. (a) Visualization of the 3D LiDAR frame. (b) Proportion of LiDAR points with respect to distance. (c) Proportion of LiDAR points with respect to categories. (d) Proportion of voxels with respect to distance. (e) Proportion of voxels with respect to categories. Voxels distribute more evenly and category proportion on voxels match more with the visualized scene.
Fig. 4: Integration of the different label definitions of datasets. Level 0: all labels of different datasets. Level 1: used in per-dataset analysis with limited label merging. Level 2: used in cross-dataset analysis, a unified label matching of all datasets.
Fig. 5: Overall analysis of each dataset. (a-c) Each data set is visualized by a representative scene and a histogram of the scene descriptor of the whole dataset. (d-e) Per-frame average instance number of three kinds of dynamic objects. Semantic3D is absent since it describes mainly static scene and has no instance label.

Ii-B Statistical Analysis of the Datasets

Three representative datasets are selected: 1) Semantic3D [7], the largest and most popular static dataset; 2) SemanticKITTI [8], the largest and most popular sequential dataset; and 3) SemanticPOSS [40], a new dataset that describes a dynamic urban scene with rich cars, people and riders. These datasets are analyzed statistically on the aspects of size and scene diversity.

Ii-B1 Outline of the analysis

A straightforward method for analyzing the dataset size is to count the point number and proportion. Table II and Fig. 2 show such statistics with respect to the range distance of three datasets as examples. It can be found that LiDAR points have a much higher density at near distances. For example, Semantic3D and SemanticKITTI both have more than 60% of their LiDAR points measured within 10 m and less than 10% in [30m,70m]. SemanticPOSS is slightly different, as a LiDAR sensor (Pandora[55]) of unevenly arranged scan lines is used, which has a higher resolution on horizontal LiDAR scans. Due to such an artifice, LiDAR points within 10m are reduced to 20%. However, more than 78% are measured within 30m, leaving only 20% in [30m,70m]. The spatial distribution of LiDAR points is very unbalanced in these datasets, which can be seen more visually in Fig. 3 using a data frame of Semantic3D. Drawing a circle at 15m to the sensor’s location, 72% percent of the total 24,671,679 LiDAR points fall into the circle, where most are on a small section of natural terrain that has very similar properties. This is a common phenomenon in currently 3D LiDAR datasets, where the objects close to the LiDAR sensor are measured with higher point density than far objects in which directly counting point number and proportion may not be meaningful in answering our question about data hungry effects.

In this research, resampling of LiDAR points, voxelization, is conducted to find datasets of uniform spatial resolution. Tessellating the 3D space evenly into voxels and projecting LiDAR points into the voxels, a set of valid voxels is obtained that has at least one LiDAR point in each, where is a

-dimensional vector, with each column

denoting the proportion of LiDAR points of label in the voxel . In this research, the voxel size is . By counting the number and proportion of valid voxels, curves are plotted in Fig. 2, which show more even distributions with respect to distance. Similar results can also be found in Table II and Fig. 3.

A number of measures are subsequently defined on voxels to analyze the statistics of the datasets. Category proportion is the proportion of LiDAR points or voxels having label . Hence, is a vector of category proportions of voxel . Scene descriptor is a -dimensional vector that characterizes the scene using category proportions. Given a voxel set of a scene, a descriptor is generated with each category proportion calculated as . In this research, can be a voxel set of a single frame (Semantic3D), a sequence of frames (SemanticKITTI and SemanticPOSS) or a dataset. Given a scene , the scene descriptor is denoted as . Dynamic objects such as vehicles, persons and riders have a much different nature than static objects such as buildings, trees and ground and are of special importance for autonomous driving applications. Thus, we define dynamic scene descriptor , which is a vector of the dynamic categories only, where each column is the instance number of the category per frame. Given a scene , the dynamic scene descriptor is denoted as , and for a multiple frame scene, instance numbers are the per frame average. Scene diversity distance is used to measure the difference between two scenes on the scene descriptors. To balance the magnitude of different categories, we denote standardized scene descriptor

, where z-score standardized category proportion

. Therein, the mean value

and standard deviation

are calculated over all scenes of the three datasets. Given two scenes and

, the scene diversity distance is estimated as

. Similarly, dynamic scene diversity distance is defined as , in which includes of only dynamic categories. In the following analysis, standardization is conducted on the values of category proportions to reduce dataset bias.

Each dataset has its own definition of labels/categories, which are much different. For comparison, merging of some labels is conducted as described in Fig. 4. The label definition of Level 2 is used to find consistency for scene diversity distance analysis and inner- or cross-dataset comparison. The label definition of Level 1 is used to represent the special characteristics of each dataset in per-dataset analysis.

Below, we first analyze each dataset on their features of scene description, then cross-scene and dataset comparison to statistically evaluate the difference in their scene diversity. Finally, we discuss the dataset concerning their representation of dynamic objects.

Ii-B2 Semantic3D

Semantic3D contains 15 scenes in the training set. Each is a single frame that is measured using a terrestrial laser scanner from a fixed position. A scene is visualized in Fig. 5(a), with the whole dataset scene descriptor plotted as a histogram. Compared with the other two datasets, it can be found that Semantic3D describes mainly static scenes, where ground, vegetation and buildings are the dominating categories, and the percentage of buildings is significantly higher than the others. It has no moving objects, except for a few parking cars.

The scenes of Semantic3D are divided into three groups, i.e., urban, rural and suburban according to the geographic location of the data measurement. As there is only one suburban scene, i.e., , it is isolated from the tree structure in Fig. 6(a). Fig. 6(a) shows scene descriptor using histograms. It can be found that the of the same group could be very different, e.g., and , whereas the scenes of different groups can be very similar, e.g., and . Regardless of whether the scenes are in the same or different groups, the category proportions of scene objects are very diversified. For example, is a railroad scene. It is full of natural terrain, with almost no buildings. are the opposite. They are cathedral scenes full of buildings but almost no vegetation.

Semantic3D has no moving objects such as person and rider. The category cars only includes a few parked vehicles without instance labels. The instance-based analysis, e.g., dynamic scene descriptor , is not adaptive to Semantic3D, and the corresponding results of Semantic3D are absent in Fig. 5 and Fig. 6.

Semantic3D describes very diversified scenes. It has rich static categories and dense point clouds. However, it describes static scenes with no moving object. Since each scene has only one LiDAR frame, this could create difficulty in training many deep learning methods.

Fig. 6: Per-scene analysis of each dataset. (a) Scene descriptors of 15 Semantic3D scenes, each is one frame with dense LiDAR points. (b) (Dynamic) Scene descriptors of 11 SemanticKITTI scenes, each is a sequence of LiDAR frames. (c) (Dynamic) Scene descriptors of 6 SemanticPOSS scenes, each is a sequence of LiDAR frames.
Fig. 7:

Cross-dataset scene diversity distance analysis. Confusion matrixes of scene (a) and dynamic scene (b) diversity distance cross the scenes of three datasets. The darker the more difference of the scenes. (S3D: Semantic3D, SKITTI: SemanticKITTI, SPOSS: SemanticPOSS)

Fig. 8:

Inner-dataset scene diversity distance analysis. Mean and variance of scene (a) and dynamic scene (b) diversity distance of each dataset, and their comparison. (S3D: Semantic3D, SKITTI: SemanticKITTI, SPOSS: SemanticPOSS)

Ii-B3 SemanticKITTI

SemanticKITTI contains 11 sequences of LiDAR frames that are measured continuously from a moving vehicle on European streets. Each sequence is treated in this research as one scene; therefore, 11 scenes are analyzed. A scene is visualized in Fig. 5(b), with the whole dataset scene descriptor plotted as a histogram. Compared with Semantic3D, it can be found that SemanticKITTI describes a larger street scene, where vegetation and ground are the two highest categories, possessing more than 50% and 27%, respectively. The proportion of buildings is low compared to other datasets. There are dynamic objects in the dataset; however, person, rider and bicycles/motorcycles are less than 0.1%.

SemanticKITTI provides instance labels of dynamic objects. The number of dynamic objects is an index to describe the complexness of a dynamic scene, which is analyzed by counting per frame instance number in Fig. 5(d). It can be found that SemanticKITTI has a good diversity of vehicle distribution, e.g., average vehicle instances per frame distribute from 0 to 32. However, persons and riders are few. Few scenes have more than 8 persons or 4 riders. This result is also confirmed by the dynamic scene descriptors of Fig. 6(b). From the 11 scene descriptors in Fig. 6(b), it can be found that there are mainly two patterns that are represented by and . The distribution of category proportions is not as diverse as that of Semantic3D.

SemanticKITTI describes street scenes from a moving vehicle in 11 sequences of 43,552 LiDAR frames. The large data size makes it very helpful for training deep learning models. However, the scenes are not as diversified as Semantic3D and have a limited number of dynamic objects.

Ii-B4 SemanticPOSS

SemanticPOSS contains 6 sequences of LiDAR frames that were measured continuously from a moving vehicle on the campus of Peking University, China, and has the same format as SemanticKITTI. Compared to the other 3D LiDAR datasets collected on structured roads or highways, SemanticPOSS describes scenes of abundant dynamic objects, mixed traffic and traffic that are not regulated tightly by rules.

Each sequence is treated as one scene; therefore, 6 scenes were analyzed. A scene is visualized in Fig. 5(c), with the whole dataset scene descriptor plotted as a histogram, where a more dynamic street scene is described. From the scene descriptors of Fig. 6(c), it can be found that the distribution of category proportions is generally similar to SemanticKITTI, but they are both very different from Semantic3D. One reason is that the data measurement method introduces large differences in datasets. Another is that a street scene is dominated by static objects such as buildings, vegetation and ground, which are of large scale and have much higher data proportions than dynamic objects. The data proportion on dynamic objects is minor, and their differences can easily be ignored if they are analyzed with static objects.

The number of dynamic objects is analyzed in Fig. 5(e). Much wider distributions can be found for all three kinds of dynamic objects compared to SemanticKITTI, where the average instances per frame are distributed from 0 to 32 for the vehicle, 24 for the person and 12 for the rider. The dynamic scene descriptors of Fig. 6(c) confirms these results too, i.e., SemanticPOSS describes scenes populated by different kinds of dynamic objects and at different crowded levels. However, from the 6 scene descriptors in Fig. 6(c), it can be found that the static scenes described by SemanticPOSS are not as diverse.

SemanticPOSS describes street scenes from a moving vehicle in 6 sequences in a total of 2,998 LiDAR frames. The data size is limited for the training of deep learning models, but it describes scenes of rich dynamics that are insufficient in other datasets.

Ii-B5 Cross-dataset Analysis

A confusion matrix of the scenes of all three datasets is shown in Fig. 7(a), where each value is the scene diversity distance of the pair of scenes; the whiter the less diversity, and the darker the more diversity. For example, the first row compares the scene diversity distances of scene of Semantic3D with the others. Additionally, from Semantic3D, and are light gray, but , and are much darker. The answer can be found in Fig. 6(a), where are rural churches, but are cathedral scenes. A similar confusion matrix is shown in Fig. 7(b) to analyze the dynamic scenes described by the three datasets. Here, each value is the dynamic scene diversity distance of the pair of scenes.

From an overall perspective, Semantic3D has richer inner-dataset scene diversity than the other two. The diagonal blocks of the confusion matrix shown in Fig. 7(a) reflect the inner-dataset scene diversity. The overall color of the Semantic3D block is darker than that of the other two blocks. With the values of the scenes from the same dataset, three boxplots are drawn in Fig. 8(a). Semantic3D has the lowest minimum and the highest maximum, reflecting rich inner-dataset scene diversity. SemanticPOSS generally has a higher scene diversity distance than SemanticKITTI.

In general, scenes from two different datasets tend to be more diverse. The nondiagonal blocks of the confusion matrix shown in Fig. 7(a) reflect the cross-dataset scene diversity. The colors of the nondiagonal blocks are generally darker than those of the diagonal blocks. Therefore, scenes belonging to the different datasets tend to be more diverse, which indicates more similarity on inner-dataset scenes and more diversity on cross-dataset scenes.

dataset average instance per frame category proportion (%)
vehicle person rider vehicle person rider
Semantic3D / / / 1.17 / /
SemanticKITTI 10.09 0.63 0.18 3.47 0.05 0.02
SemanticPOSS 15.02 8.29 2.57 4.28 1.20 0.29
TABLE III: DYNAMIC Scene OBJECTS with instance labels of the datasets

SemanticPOSS provides the richest dynamic objects and dynamic scene diversity. Fig. 7(b) reflects the large dynamic scene difference between SemanticPOSS and the other two. With values from the same dataset, three boxplots are drawn in Fig. 8(b). More specifically, the number and proportion of dynamic objects are compared in Table III. It is obvious that the SemanticPOSS scenes are very different from the other two, which confirms the results in Fig. 6. Due to the method of data acquisition, Semantic3D has only a few static vehicles, without moving categories such as person and rider. As a result, its dynamic scene diversity is fairly low in Fig. 8(b). From Table III, SemanticKITTI and SemanticPOSS both have many vehicles, where the average instances per frame are 10.09 and 15.02, respectively. In addition, SemanticPOSS has more instances of person and rider.

Due to the distinct differences in data collection, point density, category definition, and scene diversity among 3D datasets, deep learning models can hardly aggregate these datasets for training. Insufficient diversity of inner-dataset scenes but large differences among cross-dataset scenes make a single 3D dataset insufficient for model training, causing a data hungry effect to some degree. As discussed in Section IV, the models trained on one dataset may show very poor accuracy when tested on another.

Iii Methods of 3D Semantic Segmentation

In this section, we provide a comprehensive and systematic review of the representative methods of 3D semantic segmentation.

Iii-a Traditional and Deep Learning Methods

Methods of 3D semantic segmentation have been widely studied for decades. As illustrated in Fig.9, they are divided into traditional and deep learning methods depending on feature representation and processing flow.

Traditional methods

of 3D semantic segmentation often use handcrafted features to extract geometric information of points and output point labels from a classifier such as Support Vector Machine (SVM) or Random Forest (RF).

One common process of traditional methods is: 1) point cloud over-segmentation, 2) calculating feature vectors of each segment, and 3) giving each segment a semantic label by a classifier. Anand et al. [56] constructed a graph for capturing various features and contextual relations, and then used a max-margin based approach for classification. Wolf et al. [57] proposed an efficient framework capturing geometric features of each segment and labeled them by a pre-trained RF and CRF.

The other common process is directly designing feature vectors of each point without prior over-segmentation. RF_MSSF [58]

introduced a new definition of multi-scale neighborhoods to extract a consistent geometrical meaning. Weinmann et al.

[59]

proposed a framework with four independent components: neighborhood selection, feature extraction, feature selection and classification. They tried various existing approaches for each component and found the optimal combination.

Deep Learning Methods use deep neural networks to learn a feature representation and directly map input data to semantic labels through an end-to-end procedure. Recently, a number of studies on 3D LiDAR semantic segmentation have been developed using deep neural networks, which can be broadly divided into four groups, as illustrated in Fig.9 according to the formats of input data: 1) point-based methods, 2) image-based methods, 3) voxel-based methods, and 4) graph-based methods. Below, we provide a more detailed review of these groups of methods.

Fig. 9: Overview of 3D semantic segmentation methods.

Iii-B Point-based Methods

Point-based methods take raw point cloud as input directly and output point-wise labels. These methods can process arbitrary unstructured point clouds. The main difficulty of raw point cloud processing is how to extract local contextual features from the unstructured point cloud.

Iii-B1 Point-wise Shared MLP

PointNet [10]

is the pioneer of point-based deep networks for unstructured point cloud processing. It uses shared Multi-Layer Perceptrons (MLP) to extract point-wise features and aggregates global features by symmetry max pooling operation. PointNet++

[60] improved PointNet[10] by introducing multi-scale grouping of neighboring points to extract local contextual features. Its structure is shown in Fig.10(a).

Inspired by PointNet++[60], many methods seek improvements of local feature extraction through different definitions of ’neighbor’. Engelmann et al. [61] proposed multi-scale input blocks (MS) with ball neighbors, while local features are aggregated in ball neighborhoods with different scales. PointSIFT [62] attached importance to neighbors in all directions, so they aggregated local features from 8 directions to obtain a better contextual representation. Engelmann et al. [63]

introduced K-nearest neighbors in the learned feature space to compute contextual information and world space neighbors computed by K-means for local geometry information.

The sampling approach is another improvable factor of the shared MLP architecture. Farthest Point Sampling (FPS) in PointNet++[60], which selects the farthest points from candidates at each step, is widely applied. SO-Net [64]

depends on Self-Organizing Map (SOM)

[65] for sampling, which is trained with unsupervised competitive learning for point cloud spatial distribution modeling. RandLA-Net [52] uses random sampling to reduce computational and memory costs. Yang et al. [66] proposed Point Attention Transformers (PATs) using Gumbel Subset Sampling (GSS), which is differentiable, task-agnostic and independent of the initial sampling point.

Specifically designed layers are frequently imported for better local feature extraction. LSANet [67] improved feature extraction by introducing a novel Local Spatial Aware (LSA) layer, which takes the spatial distribution of point clouds into account by Spatial Distribution Weights (SDWs). PyramNet [68] introduced two novel operators, Graph Embedding Module (GEM) and Pyramid Attention Network (PAN), resulting in more accurate local feature extraction.

Iii-B2 Point Convolution

Convolution is the core operation for feature extraction in 2D image semantic segmentation tasks, which requests ordered inputs for local contextual information extraction. Several methods contributed to constructing an ordered feature sequence from unordered 3D LiDAR data, and then convolutional deep networks were transferred to 3D LiDAR semantic segmentation. PointCNN [69] ordered K-nearest points by their spatial distance to the centers, which is called the -Conv operator for point convolution. The differentiability of -Conv makes PointCNN possible to train by back propagation. To avoid overlaps between multi-scale regions in PointNet++[60], A-CNN [70] introduced annular convolution applied to the ordered constrained K-nearest neighbors, which also helps to capture better geometric representations of 3D shapes. Kernel Point Convolution (KPConv) [71] can operate on point clouds without any intermediate representation. Alterable kernel points make KPConv more flexible than most fixed grid convolutions.

Engelmann et al.[72] demonstrated that the size of receptive fields is directly related to the performance of semantic segmentation. Therefore, they proposed dilated point convolution (DPC) to increase the receptive field of convolutions by sorting nearest neighbors and computing only every -th point. Point Atrous Convolution (PAC) in PointAtrousNet[73] and PointAtrousGraph[74] was also introduced to increase the receptive field and extract multi-scale local geometric features without extra parameters.

For computing point convolutions more efficiently, tangent convolution [75][76] projected points from spherical neighborhoods to a tangent image for extracting surface geometry features. It is efficient and feasible on large-scale data with millions of points. ShellNet [77] introduced permutation invariant ShellConv using statistics from concentric spherical shells to extract representative local features, which can efficiently capture larger receptive fields with fewer layers.

Iii-B3 Recurrent Neural Network

Recurrent Neural Networks (RNN) are often used to extract contextual information of a sequence. For 3D semantic segmentation, RNN can extract spatial context by feeding ordered feature vectors in space. Engelmann et al. [61] proposed Grid (G) and Recurrent Consolidation Unit (RCU), which divide space into several grids as the network input. Each grid is fed into a shared MLP unit and RCU to extract point-wise features and local contextual features. RSNet [78] transfers unordered points into an ordered sequence of feature vectors with a slice pooling layer. Then, RNN takes the sequence as input and aggregates spatial context information.

Iii-B4 Lattice Convolution

A sparse permutohedral lattice is suitable for sparse data processing such as point clouds. SPLATNet [79] applies the Bilateral Convolution Layer (BCL) [80] to provide a transformation between point clouds and sparse lattices, which performs convolutions efficiently. The network combines several BCLs to extract local features hierarchically. LatticeNet [81] introduced a novel slicing operator for lattice processing. Different from SPLATNet, its splatting and slicing operations are trainable in LatticeNet to obtain a better local feature representation.

Iii-C Image-based Methods

Image-based methods project 3D LiDAR data onto a surface to generate 2D images as deep model inputs. These methods are usually derived from image semantic segmentation models, such as Fully Convolutional Network (FCN) [19] and U-Net [82]. The output predictions with pixel-wise labels are reprojected to origin 3D LiDAR points.

Iii-C1 Multi-view Segmentation

A simple projection strategy is choosing several positions for taking photos of given point clouds. Felix et al. [83] rotated a virtual camera around a fixed vertical axis to generate multi-view synthetic images, which were processed by a FCN-based multi-stream architecture. Pixel-level prediction scores were summed and then reprojected to 3D LiDAR points. Boulch et al. [84] first generated a mesh of 3D LiDAR data, and then produced images by randomly choosing virtual camera positions. Finally, multi-view segmentation labels were reprojected to origin points. For these multi-view methods, it is important to choose appropriate camera positions and projection strategies to reduce information loss.

Iii-C2 Range Image Segmentation

Rang image is usually generated by projecting one frame 3D LiDAR data onto a spherical surface. SqueezeSeg [11] is a typical end-to-end network for range image semantic segmentation based on SqueezeNet [85] and CRF. SqueezeSegV2 [54], whose architecture is shown in Fig.10

(b), improved SqueezeSeg in model structure, loss function and robustness.

Range image segmentation methods are usually implemented on sequential datasets, while spatial and temporal information can be incorporated. Mei et al. [86] introduced spatial constraints for predictions with more region consistency. RangeNet++ [87] contributed to alleviating problems caused by discretization or blurry CNN outputs. A GPU-accelerated post-processing was added to recover spatial consistency results. DeepTemporalSeg [88] introduced temporal constraints based on a Bayes filter to make predictions more temporally consistent.

Some range image segmentation methods have focused on real-time performance, which is essential for applications such as autonomous driving and unmanned detectors. LiSeg [89] is a lightweight FCN-based model that reduces convolution filters and parameters for less memory consumption and computational efficiency. PointSeg [90] is a lightweight network based on SqueezeNet[85], which balances efficiency and accuracy by introducing enlargement layers and squeezing reweighting layers. Both LiSeg[89] and PointSeg[90] use dilated convolution to aggregate multi-scale context information and accelerate computation. RIU-Net [91] is a lighter version of U-Net[82] that requires lower computational time and memory.

Iii-D Voxel-based Methods

Voxel-based methods transfer 3D LiDAR data into voxels for structured data representation. These methods usually take voxels as input and predict each voxel with one semantic label.

A number of voxel-based methods are based on 3D Convolutional Neural Network (3D CNN). Huang and You

[92] proposed a typical 3D CNN-based framework for point cloud labeling. SEGCloud [93]

is based on 3D Fully Convolutional Neural Network (3D FCNN), which is used to provide voxel-wise probabilities of each category. Then, trilinear interpolation and CRF were used to transfer probabilities back to raw 3D points, while maintaining spatial consistency. Fully-Convolutional Point Network (FCPN)

[94] is also derived from 3D FCNN, but its input voxel features are encoded by a simplified PointNet[10].

It is challenging for voxel-based methods to find a proper voxel size that balances precision and computational efficiency. Some methods have contributed to reducing computational cost of 3D convolution on sparse data while maintaining acceptable accuracy. Graham et al. [95] implemented a new Sparse Convolutions (SCs) and a novel operator called Submanifold Sparse Convolution (SSC) to obtain more efficiency. Zhang et al. [96] reduced the computational cost of 3D convolution by treating the gravitational axis as a feature channel and using 2D convolution to process voxelized points. ScanComplete [97] is a 3D CNN-based model with a coarse-to-fine prediction strategy, which can dynamically choose voxel size and aggregate multi-scale local features. VV-NET [98]

is a kernel-based interpolated Variational Auto-Encoder (VAE). Radial Basis Functions (RBFs) and group convolution are introduced to handle sparse data efficiently. VolMap

[99] is a lightweight version of U-Net[82] for real-time semantic segmentation by taking volumetric bird-eye view LiDAR data as input. 3DCNN-DQN-RNN [100] fuses 3D CNN, Deep Q-Network (DQN) and Residual RNN for efficient 3D semantic segmentation, while 3D CNN extracts multi-scale features, DQN localizes and segments objects, and RNN merges features into a robust representation.

Fig. 10: The neural network architecture of the Selected Methods in Experiments. (a) PointNet++[60]. (b) SqueezeSegV2[54]. (c) SPG [101].

Iii-E Graph-based Methods

Graph-based methods construct a graph from 3D LiDAR data. A vertex usually represents a point or a group of points, and edges represent adjacency relationships between vertexes. Graph construction and graph convolution are two key operations of these methods.

As shown in Fig.10(c), Super-Point Graph (SPG) [101] is a representative work. This network employs a PointNet[10] to encode vertex features and graph convolutions to extract contextual information. GACNet [102] proposed a novel graph convolution operation, Graph Attention Convolution (GAC). In graph construction, each vertex represents one point, and edges are added between each vertex and its nearest neighbors. Standard graph convolution neglects structural relations between points of the same object. GAC dynamically assigns attentional weights to different adjacent points to overcome this limitation. Jiang et al. [103] proposed a hierarchically constructed graph instead of a fixed graph. The graph is initialized from a coarse layer and gradually enriched along the point decoding process. Benefiting from the hierarchical graph, edge features in different layers integrate contextual information in local regions.

Experiment 1 cross-scene generalization evaluation. Experiment 2 cross-dataset generalization evaluation. Experiment 3 Dataset size effects evaluation.
Scope Scene diversity Scene diversity Dataset size
Purpose Train models on a single dataset with different scene diversity, and examine how the data hungry problem of scene diversity affects models’ performances. Train models on different datasets, and examine how the data hungry problem of scene diversity affects models’ performances. Examine how the data hungry problem of dataset size affects models’ performances, and whether the models are hungry for dataset size.
Dataset Semantic3D SemanticKITTI, SemanticPOSS SemanticKITTI
Model PointNet++, SPG PointNet++, SqueezeSegV2, SPG PointNet++, SqueezeSegV2, SPG
Method Three sub-datasets are made on Semantic3D, 1) urban: a dataset contains urban scene only, 2) rural: a dataset contains rural scenes only, 3) mix: a dataset contains both rural and urban scenes. Each sub-dataset is divided randomly into two parts for training and testing. The selected models are trained and tested crosswise on these sub-datasets. Three datasets are used, SemanticKITTI, SemanticPOSS, and a mixed dataset, which contains both SemanticKITTI and SemanticPOSS data. Similar to Experiment 1, selected methods are trained and tested crosswise on these datasets. Evaluate the model performance using different amounts of training data. We use parts of SemanticKITTI data to train the models and compare mIoU of the model predictions.
Label for testing man-made terrain, natural terrain, high vegetation, low vegetation, building, hard scape, car person, rider, vehicle, traffic sign/pole, trunk, vegetation, fence, building, bicycle/motorcycle, ground person, rider, vehicle, traffic sign/pole, trunk, vegetation, fence, building, bicycle/motorcycle, ground
Other details Weights of all categories are the same for training. Weights of all categories are the same for training. Single frame of point clouds is used as input, not overlapped frames. Weights of all categories are the same for training. Single frame of point clouds is used as input, not overlapped frames.
Result Table V, Fig.12, Fig.11(a) Table VI, Fig.13, Fig.11(b) Table VII, Fig.14(a)
TABLE IV: Design of experiments.

Iv Data Hungry or Not? Experiments

As addressed in the previous sections, three representative datasets, Semantic3D[7], SemanticKITTI[8] and SemanticPOSS[40], are analyzed statistically. In this section, we design three experiments to answer the following questions: How dose scene diversity influence the model performance? How dose training dataset size influence the model performance? Does data hungry problem in scene diversity and dataset size exist for 3D LiDAR datasets?

Iv-a Selected Methods in Experiments

Three representative methods are selected in the experiments, PointNet++ [60], SqueezeSegV2 [54], and SPG [101]. Each of them represents one type of mainstream methods.

PointNet++ is a typical point-based method taking raw point clouds as input. The architecture of PointNet++ is shown in Fig.10(a). PointNet++ is a hierarchical encode-decode structure based on shared MLP. The sampling layer, grouping layer and PointNet layer are used to learn local contextual features. Many point-based methods are derived from the PointNet++ architecture and optimize tje sampling approach and grouping approach to improve the performance.

SqueezeSegV2 is a typical image-based method taking range images as input. The architecture of SqueezeSegV2 is a convolutional neural network, which is shown in Fig.10(b). SqueezeSegV2 is chosen as a deputy of CNN-based architectures, which is similar to most image-based methods and voxel-based methods.

SPG is a typical graph-based method taking the super-point graph as input. A segmentation algorithm is used to partition point clouds into several groups as vertexes of the graph. Edges are constructed to represent contextual relationships between vertexes by comparing the shape and size of the adjacent point groups. The architecture of SPG is shown in Fig.10

(c). The PointNet layer and Gated Recurrent Unit (GRU) are used to learn local contextual features and implement graph convolution.

Iv-B Outline of the Experiments

For the 3D semantic segmentation task, deep learning models need to give semantic predictions to every point of the given point cloud. To evaluate the model performance, we use the Intersection over Union (IoU) given by

(1)

where ,, denote the number of true positive, false positive, false negative predictions of category . Let be the number of categories used for measurement, the mean IoU (mIoU) is defined as the arithmetic mean of IoU, namely,

(2)

To analyze the data hungry effects of scene diversity and data size, three experiments shown in Table IV, are designed.

[b] Category Model PointNet++ SPG [width=1.8cm, height=0.6cm]TestTrain urban rural mix urban rural mix man-made terrain urban 95.3 90.0 94.3 99.7 99.6 99.6 rural 89.0 91.8 88.3 80.0 96.1 96.5 mix 92.2 90.5 91.4 97.7 99.3 99.4 natural terrain urban 92.1 68.0 85.6 93.8 79.3 84.6 rural 80.0 85.3 78.3 32.8 92.4 91.1 mix 82.1 79.3 79.6 51.2 88.1 89.1 high vegetation urban 88.5 91.3 90.9 91.4 93.2 93.4 rural 90.2 93.3 87.4 11.2 85.9 34.7 mix 90.2 93.1 88.1 49.0 89.1 60.8 building urban 96.5 95.9 97.5 95.3 87.3 94.1 rural 89.8 92.3 88.5 76.9 95.1 92.8 mix 94.0 94.5 93.9 90.7 88.9 93.8 mIoU urban 69.6 62.0 68.6 80.5 68.4 78.9 rural 71.9 78.9 73.5 38.3 86.6 71.4 mix 71.7 72.1 71.7 62.6 75.3 75.7

TABLE V: Result of Experiment 1: cross-scene generation evaluation.
  • IoU of some dominant categories. Deeper color means the better performance on a specific test scene using a model.

Fig. 11: mIoU of the models trained on different scenes (PN++: PointNet++, SqV2: SqueezeSegV2). (a) Result of Experiment 1. (b) Result of Experiment 2. The different color of box means different training scenes. The values of vertical axis means mIoU. Each box shows the maximum, minimum and median of the mIoU on different test set.

[b] Category Model PointNet++ SqueezeSegV2 SPG [width=2.0cm, height=0.6cm]TestTrain SKITTI SPOSS mix SKITTI SPOSS mix SKITTI SPOSS mix person SKITTI 0.7 6.4 1.8 15.4 2.9 5.7 2.8 0.9 3.9 SPOSS 0.0 20.8 18.7 0.0 18.4 15.8 4.2 17.2 18.0 mix 0.4 13.6 10.3 7.7 10.7 10.8 3.5 9.1 11.0 vehicle SKITTI 53.2 16.6 53.1 78.4 8.0 30.1 49.0 15.9 16.1 SPOSS 4.1 8.9 8.8 1.6 34.9 15.7 15.3 11.5 14.3 mix 28.7 12.8 31.0 40.0 21.5 22.9 32.2 13.7 15.2 vegetation SKITTI 64.0 48.2 64.3 72.6 56.3 40.0 51.2 33.8 51.7 SPOSS 46.3 51.2 50.0 8.9 0.2 22.7 49.2 52.8 52.4 mix 55.2 49.7 57.2 40.8 28.3 31.4 50.2 43.3 52.1 building SKITTI 61.8 41.5 62.7 69.7 47.0 47.1 38.7 16.2 33.7 SPOSS 18.9 42.7 36.7 9.2 9.6 47.0 34.3 55.3 51.6 mix 40.4 42.1 49.7 39.5 28.3 47.1 36.5 35.8 42.7 ground SKITTI 80.9 57.2 80.8 88.2 71.3 66.3 51.3 36.8 51.4 SPOSS 40.0 62.2 62.0 45.1 28.9 56.5 36.3 75.6 73.4 mix 60.5 59.7 71.4 66.7 50.1 61.4 43.8 56.2 62.4 mIoU SKITTI 30.4 16.4 30.5 44.8 5.0 23.5 17.0 6.0 16.1 SPOSS 12.7 20.1 19.5 6.6 29.8 19.6 8.8 28.6 20.9 mix 21.6 18.3 25.0 25.7 17.4 21.6 12.9 17.3 18.5

TABLE VI: Result of Experiment 2: cross-dataset generation evaluation.
  • SKITTI denotes SemanticKITTI, SPOSS denotes SemanticPOSS.

  • IoU of some dominant categories. Deeper color means the better performance on a specific test scene using a model.

Fig. 12: Case study of Experiment 1 of SPG result. (HV: high vegetation, B: building). The model trained on rural scenes generally performs better on high vegetation than trained on urban scenes. The model trained on mixed scenes performs more robustly across all scenes.
Fig. 13: Case study of Experiment 2 of PointNet++ result. Models trained on SemanticPOSS perform better in labeling dynamic objects such as box A, D, E, but makes more mistakes in some static objects such as box B, C, F. Models training on SemanticKITTI are on the contrary. If using both datasets to train, models perform more robust and makes less mistakes.

Iv-C Results

The results of Experiment 1 are shown in Table V. Because SemanticKITTI is much larger than SemanticPOSS, we add different weights when calculating the mIoU of the mixed dataset to balance the data size bias. In Table V and Table VI, experimental performances are colorized in units of 33 blocks. In each 33 block, the best result in each row is marked as the deepest red, and the worst is white. The medium results are colored depending on their distance to the best one. Let us see the left bottom 33 block in Table V as an example, i.e., mIoU of different models based on PointNet++. In the first row, 69.6 indicates that the model trained on the urban set performs the best on the urban test set, and the model trained on the rural set achieves the worst mIoU (62.0) on the urban test set. For both Table V and Table VI, we can see a specific model’s performance on different test scenes from a column view, and the performances of different models on a specific test scene from a row view.

To compare the general performance and robustness of a specific model, we use the box plot shown in Fig.11. Each box shows the maximum, minimum and median of the mIoU on different test sets. A higher position of a box indicates that the model performs relatively better, and a shorter length of a box indicates that the model performs relatively robustly. In addition, Fig.12 and Fig.13 intuitively shows some prediction results of several test scenes. We emphasize some special objects of the scenes for better comparison.

The results of Experiment 3 are shown in Table VII and Fig.14(a). The curve of Fig.14(a) shows the performance trends of the models changing along with the training data size.

Iv-D Findings

We will answer the question of whether there are data hungry effects for 3D semantic segmentation using deep learning methods from two aspects, scene diversity and dataset size.

Iv-D1 Scene diversity

From the results of Experiment 1 and Experiment 2, we can summarize our findings as follows:

- Performance decrease occurs when testing a model at scenes much different from the training scenes. As shown in Table V, all models trained on rural scenes have a performance decrease when testing on urban scenes, and vice versa. This is probably caused by the high scene diversity between urban and rural scenes. A similar phenomenon appears in experiments with SemanticKITTI and SemanticPOSS, as shown in Table VI. Apart from the mixed dataset, mIoU always decreases when testing on different datasets with the training dataset. Both Table V and Table VI show deeper color on the diagonal of each 33 block, which indicates better performance on similar scenes.

- Preponderant categories are easier to distinguish. A specific example is the high vegetation in Table V. Models trained on rural scenes are good at classifying high vegetation because of its preponderance in rural scenes. Table VI shows a similar phenomenon. For example, SemanticPOSS has a higher density of person than SemanticKITTI but fewer samples of other categories. As a result, models trained on SemanticPOSS achieve better performance on person, but are generally weaker on other categories. Several specific model predictions are shown in Fig.12 and Fig.13.

- High scene diversity of a training set can improve the robustness of the model. As shown in Table V and Table VI, models training on the mixed scenes obtain acceptable predictions regardless of the test scenes. Fig.11 summarizes different model performances using boxplots. Obviously, the mixed model has shorter boxes, which means a narrow distribution of minimum/maximum mIoU in general, showing its more stable performance and robustness. From the PointNet++ results in Fig.13, it can be found that the mixed model tends to combine adept categories from model training on a single dataset, i.e., static categories for SemanticKITTI and dynamic objects for SemanticPOSS.

- In summary, the data hungry problem in scene diversity currently exists for 3D LiDAR datasets. Lack of some specific category or biased category distribution are common phenomena for datasets. For example, Semantic3D does not have dynamic categories such as person, making it unsuitable for applications such as autonomous driving systems. Models training on SemanticKITTI may not be suitable for environments with high density persons. In addition, a single dataset usually does not have enough scene diversity to obtain a well-generalized model, and training models in mixed datasets may lead to some improvement of scene diversity. However, simply merging more datasets to obtain better scene diversity will face great resistance, and even lead to model performance decreases, which are further discussed in Section VI. Therefore, the data hungry problem in scene diversity is still challenging for 3D LiDAR semantic segmentation models to improve their generalization ability.

Iv-D2 Dataset size

Category Modelsize 12.50% 25% 50% 75% 100%

person

PointNet++ 0.3 0.3 0.4 0.5 0.7
SPG 0.6 0.4 1.1 1.0 2.3
SqueezeSegV2 1.3 3.0 6.9 7.5 15.4

vehicle

PointNet++ 52.2 54.4 54.5 54.5 54.9
SPG 44.9 46.2 47.0 48.8 50.1
SqueezeSegV2 61.6 68.4 71.6 74.7 78.4

vegetation

PointNet++ 63.3 64.7 64.1 62.8 64.0
SPG 47.4 47.9 53.1 52.6 52.7
SqueezeSegV2 60.2 66.6 70.8 71.7 72.6

building

PointNet++ 59.2 61.7 61.8 60.6 61.9
SPG 33.2 36.7 34.1 36.1 38.0
SqueezeSegV2 57.6 64.7 68.4 69.0 69.7

ground

PointNet++ 63.0 63.5 64.4 65.0 65.3
SPG 45.0 44.9 50.7 51.9 52.2
SqueezeSegV2 78.7 79.7 86.2 88.2 88.2

mIoU

PointNet++ 19.7 21.2 21.5 21.5 21.9
SPG 13.2 14.5 15.2 15.8 17.1
SqueezeSegV2 24.4 28.2 32.6 34.5 37.2
  • IoU of some dominant categories using different training dataset size. Experiment on SemanticKITTI dataset.

TABLE VII: Result of Experiment 3 – Dataset size effects evaluation.
Fig. 14: (a) Plots of training DATASET SIZE v.s. MODELS IoU. Experiment on SemanticKITTI dataset. (b) Plots of MODELS accuracy with respect to distance. Experiment on SemanticPOSS dataset.

The results of Experiment 3 illustrate the facts as follows:

- Increasing training data improves model performance. All three models show uptrends with increasing training data. It is easy to understand that incremental data provides more features and information for models.

Second, different models have different sensitivitis to the quantity of training data. As shown in Fig.14(a), the uptrend of SqueezeSegV2 is more significant than that of PointNet++. It seems that the data requirements of different models are different. SqueezeSegV2 takes range images as input, which are sensitive to the LiDAR’s position. Incremental LiDAR frames in a scene captured at different viewpoints may provide more information for range image inputs than point cloud inputs. Therefore, the curve of PointNet++ seems to be saturated with 25% training data, while the curve of SqueezesegV2 maintains its growing trend.

- In summary, the data hungry problem for dataset size exists for current 3D LiDAR datasets. The continuous uptrend of the mIoU-size curve indicates that the model does not reach limit of its ability and requires more data for improvement. It can be predicted that the mIoU will continue increasing if more training data are used. All three models show a continuous uptrend to different degrees when adding training data. Therefore, for most deep learning models, existing datasets are not sufficient.

Fig. 15: Car instances with different point number in range image. The point numbers of instances are shown in black text and corresponding distances are shown in red text.
Fig. 16: Instances with different point number distribution in SemanticKITTI and SemanticPOSS.

Iv-D3 Instance Distance and Quality

In 3D datasets, point clouds become sparser with increasing distance to the sensor. Therefore, the points far away from the sensor are hard to be correctly classified. As shown in Fig.14(b), the model prediction accuracy decreases with increasing distance, but their downtrends are different. PointNet++ and SqueezeSegV2 show an obvious downtrend, but SPG does not clearly show an accuracy drop.

For an object, the further away from the sensor, the fewer number of points it contains and the higher possibility it will be occluded. Because the features of an object with too few points are vague and confusing, even for a human, it is difficult to definitely distinguish them. Fig.15 shows some car instances with different points in the range image view. It is difficult to recognize instances with fewer than 150 points or more than 25m. The car features become clear with the increase in points. Therefore, it is reasonable to use the point number as a measurement of instance quality.

We calculate statistics of the point number distribution of person and vehicle instances in SemanticKITTI and SemanticPOSS, as shown in Fig.16. More than 50% of instances contain fewer than 120 points, which makes no significant contribution to model training. Although it is inevitable for 3D LiDAR datasets to contain these instances, they truly enhance the data hungry problem for data size in the 3D LiDAR semantic segmentation task.

V Efforts to Solve Data Hungry Problems

Fig. 17: Overview of efforts to solve data hungry problems. (General methods

: ideas come from computer vision or machine learning studies but can be heuristic or generalized to solve data hungry problems of 3D LiDAR semantic segmentation.)

The data hungry problem is currently a general problem of deep learning systems[21], where large research efforts have been made for solutions in the fields of machine learning, including computer vision, intelligent vehicles, and robotics. These efforts can be broadly divided into two groups: 1) developing new methodologies that do not require a large quantity of fine annotated data and 2) developing new data annotation methods that are less human intensive. Both efforts can be further divided into two groups: 1) incorporating domain knowledge for 3D LiDAR data processing and 2) general purposes that have proven to be useful on other kinds of data yet have not been adapted on 3D LiDAR applications. Hereinafter, we refer to these two groups as 3D LiDAR methods and general methods for conciseness. Fig. 17 illustrates the state-of-the-art of these efforts.

V-a Methodology

V-A1 Weakly and semi-supervised methods

Weakly and semi-supervised methods are widely used to solve the data hungry problem with data size. They aim to mine the value of weak supervision as much as possible for training deep networks with limited data size.

General methods

Most studies on weakly and semi-supervised semantic segmentation are investigated in the image domain. Due to the existence of large image classification datasets such as ImageNet

[26], image-level annotations are easy to obtain as weak supervisions of semantic segmentation[104] [105]. Sometimes, image-level weak supervision is integrated with additional information for better performance, such as prior knowledge of object size[106], saliency models indicating object regions[107] [108] [109] and superpixels[110] [111]. In addition to image-level labels, there are many other types of weak supervision with relatively low labeling costs, such as bounding boxes [104][112] [113], scribbles[114] and point supervision[115]. Prior knowledge can provide general constraints on objects and help make the predictions more detailed. Priors of objectness[107] [110], class-agnostic shape[111] [116], street layout[117] or combinations of several priors[106] are all useful for weak supervision. Pseudo labeling is an instinctive choice for weak supervision. The general idea uses model predictions to annotate unlabeled data, which is universally applicable for classification[118][119] and semantic segmentation[120][121]

. In addition, there are various ideas for weakly and semi-supervised learning, such as spatiotemporal constraints from videos

[122] or optical flow[123], supervisions from other modalities, such as GPS[124][125][126] and LiDAR[127].

3D LiDAR methods

In 3D LiDAR semantic segmentation tasks and available weak supervision are not as abundant as those in image-related tasks. Xu et al.[128] used point supervision and only needed 10 fewer labels. In addition, Mei et al.[129] proposed a rule-based classifier that integrates prior human domain geometrical knowledge and automatically generates weak annotations for network pretraining.

Pseudo labeling is another common idea that can be implemented for 3D LiDAR data. Xu et al.[130] designed a PointNet-based semi-supervised framework, and spatial relationships were introduced to assist the pseudo labeling process.

Additional multimodal information is helpful when facing data hungry problems. Kim et al.[131] proposed a novel multimodal CNN architecture for season-invariant semantic segmentation tasks that fused images and 3D LiDAR information. They achieve 25% IoU improvement without collecting extremely expensive long-term datasets.

In addition, spatiotemporal constraints could help transit weak supervisions from adjacent points or LiDAR frames. Mei et al.[132] integrated spatial constraints with features and label consistency into an objective function, which helps the model simultaneously consider intraclass compactness and interclass separability. Mei et al.[86] used spatiotemporal relations in neighboring frames to automatically generate sample constraints. By punishing constraints with inconsistent labels, they achieved near fully supervised performance with only a few manual annotations. Dewan et al.[88] proposed a Bayes filter based method using knowledge from previous scans, which makes the sequential predictions more temporally consistent with limited training data.

V-A2 Self-supervised methods

For most deep learning frameworks, it is a common choice to pretrain with large-scale datasets such as ImageNet[26] before fine-tuning the model to specific visual tasks. However, when facing the data hungry problem in large-scale datasets, self-supervised learning methods can take the place of pretraining by introducing designed pretext tasks.

General methods

Generally, models are trained on pretext tasks to learn meaningful representations related to the target task without any human annotations. Some typical pretext tasks include context prediction[133][134], inpainting[135]

, colorization

[136][137] and temporal correlation[138][139]. Although researchers have designed various pretext tasks, self-supervision performance is still not equal to pretraining. Zhan et al.[140] attempted to overcome this gap by incorporating a ‘mix-and-match’ tuning stage in self-supervised learning. Noroozi et al.[141] also made attempts by decoupling the self-supervised learning pipeline, which helped the models perform better from the same pretext task and transfer knowledge to improve learning.

3D LiDRA methods

Inspired by the context prediction pretext task for image semantic segmentation, Sauder et al.[142] attempted to learn from the spatial distribution. They randomly rearranged voxels (parts that construct an integral object, wheels of a car, wing of a plane, etc.) and attempted to predict the voxels’ origin positions, which helps the model learn some prior common sense about objects’ spatial structures.

In addition to learning from pretext tasks, self-supervised methods can also be implemented for clustering points with similar semantic information or common features. Maligo and Lacroix [143]

classified point clouds into a large set of categories through self-supervised Gaussian mixture models, which means annotators can simply assign semantic labels to these categories instead of point-level manual annotation.

V-A3 Transfer learning

Data hungry problems are reflected not only by the desire of dataset scale. Hungry data diversity prevents deep models from generalizing among different scenarios. Transfer learning methods are designed to apply gained knowledge from a known domain to new domains.

General methods

One common approach to transferring knowledge between domains is adversarial methods. Biasetton et al.[144] proposed an unsupervised domain adaptation strategy to reduce the gap between synthetic and real data. Vu et al.[120] pushed the model’s decision boundaries toward target domains through an entropy minimization objective. Romera et al.[145] used a generative adversarial network to convert images between day and night, which assists in data hungry problems under diverse illumination conditions.

Another commonly used approach is combining techniques from self-training. Zou et al.[121] made use of pseudo labels from predictions in the target domain for alternative self-training. Spatial priors were incorporated for the category imbalance problem.

Furthermore, Jaritz et al. [146] explored transferring knowledge between multimodal domains. Specifically, using information from the LiDAR domain, their 2D network learns the appearance of objects in both day and night, and then the 3D network benefited from the image domain to reduce false predictions.

3D LiDAR methods

Transfer learning can help transfer knowledge from other domains to reduce the data demand of 3D LiDAR. Wu et al.[11] attempted to obtain extra training data from a LiDAR simulator GTA-V. To make the synthetic data more realistic, they transferred the noise distribution of KITTI data to synthesized data. In [54], they proposed an upgrade version for domain shift with three components: learned intensity rendering, geodesic correlation alignment and progressive domain calibration. With knowledge transferred from the real world, models trained on synthetic data can even outperform baselines trained on real datasets.

Other than the scale of datasets, the data hungry problem may be reflected by imbalanced categories distribution. It is usually difficult to classify nondominant categories because of their rare appearance and few sensor perceptions. Abdou et al.[147] proposed a weighted self-incremental transfer learning method as a solution. They reweighted the loss function based on the proportion of categories and trained nondominant categories preferentially, which helped solve the data hungry of nondominant categories.

V-A4 Few-shot learning

Few-shot learning aims to generalize to new tasks with only a few annotations and prior knowledge. Its extremely efficient usage of annotations is expected to be helpful with data hungry problems. To the best of our knowledge, few-shot learning has not been implemented on 3D LiDAR semantic segmentation, so we only review general methods.

General methods

First, few-shot learning was introduced by Shaban et al.[148] for one-way semantic segmentation. They proposed a two-branch architecture to predict pixel-level segmentation masks given only a single image and its annotation. After that, Zhang et al.[149] and Hu et al.[150] followed the two-branch architecture design and introduced an attention mechanism for enhancement.

More generally, algorithms need to simultaneously segment more than one object for application, i.e., N-way semantic segmentation. Dong and Xing [151] first formulated the N-way k-shot semantic segmentation problem. They followed the prototype theory from cognitive science[152] and extended it from prototypical networks[153] for few-shot classification. Wang et al.[154] also made use of learned prototypes to distinguish various semantic classes. Different from previous methods, Tian et al.[155] employed an optimization-based method that leverages the linear classifier instead of nonlinear layers for training efficiency. Bucher et al.[156] moved forward from few-shot to zero-shot semantic segmentation. The core idea is to transfer semantic similarities between linguistic identities from text embedding space to visual space.

V-B Data Annotation

V-B1 Fully/semi-automatic annotation

The high cost of pixel/point-level human annotations is one of the most important factors causing data hungry problems. Many studies have focused on obtaining cheaper annotations by designing fully or semi-automatic annotation methods.

General methods

Fully automatic annotation often uses labels from additional sensors, such as drivable paths from GPS[124][125][126] and road marking annotations from LiDAR’s intensity channel[127]. They are usually considered weak supervisions for supplements.

Compared with manually annotating all pixels, some semi-automatic annotation methods attempt to obtain dense segmentation masks through simple clicks. Saleh et al.[157] generated foreground masks based on objectness priors and smoothed them with a dense CRF module. Kolesnikov et al.[158] clustered image regions encoded in a deep network’s midlevel features to obtain segmentation masks. Petrovai et al.[159] proposed a tool for annotators with a slider to adjust the threshold for the region merge classifier. Alonso et al.[160] proposed an adaptive superpixel segmentation propagation method to automatically augment the sparse point annotations from human to dense point annotations. Mackowiak et al.[161]

utilized the active learning idea, which makes the model select worthiest regions for hand labeling. They achieved 95% performance with only 17% labeling efforts, which significantly reduces the annotation cost.

3D LiDAR methods

Fine annotations in 3D space are much more time consuming. Methods that attempt to obtain fully automatic annotations usually depend on prior knowledge of geometry or labels from other sensors. Li et al.[162]

used a decision-tree model to integrate prior knowledge among different categories and generate initialized training labels of object segments. Inspired by

[163], a confidence estimation was designed for filtering mislabeled training samples. In addition to the methods based on prior knowledge, several studies[164][165][166] introduced labels from camera data helping 3D LiDAR annotations.

In addition, some semi-automatic methods have attempted to make 3D LiDAR annotation easier and faster. Luo et al.[167] introduced an active learning framework incorporating neighbor-consistency priors to create a minimal manually annotated training set. As a result, only a few supervoxels need to be annotated. Maligo et al.[143] tried to cluster points into a large set of categories before manual annotation. A Gaussian mixture model was introduced for unsupervised clustering. Human annotators only need to group these categories and give them semantic labels through only a few shots.

V-B2 Synthetic data

With the impressive progress of computer graphics, synthetic data have become an alternative to expensive and time-consuming manual annotations. Ground truth annotations can be easily obtained while playing video games or running a simulator.

General methods

Video games are usually the first choice for synthetic data collection. Richter et al. built synthetic datasets called GTA-V[168] and VIPER[169] based on the commercial game engine Grand Theft Auto V. VIPER provides the ground truth of image semantic segmentation, instance segmentation, 3D scene layout, visual odometry and optical flow. Krahenbuhl[170] extended the data collection across three video games with more diverse scenarios.

Scenarios based on video games only provide limited freedom for customization. To overcome this drawback, SYNTHIA [171] and VEIS[172] used the Unity3D[173] development platform to design urban structures and add objects optionally.

Synthetic data provide economical supplements for data hungry , but domain adaptation techniques are still needed when applied in the real world.

3D LiDAR methods

Several attempts have been made to obtain point-level labels of 3D LiDAR data from simulators. Wang et al.[174] proposed a pipeline for automatic generation of simulated 3D LiDAR with point-level labels. It is based on CARLA[175], a simulator for autonomous driving. There are some other general synthetic datasets, such as SYNTHIA[171] and [170], which are based on video games. They are not designed for obtaining synthetic 3D LiDAR data, but the annotations of depth images may be exploited for the acquisition of 3D LiDAR data.

Except for acquiring annotations from thorough synthetic scenes, Fang et al.[176] proposed a novel LiDAR simulator that augments real scene points with synthetic obstacles. The real background point clouds are obtained using a LiDAR scanner to sweep the street view. Then, synthetic movable obstacles can be placed in the real background. An upgraded version that extracts real traffic flows is described in [177]. The combination of real static backgrounds and realistic obstacle placement learned from real traffic flows decreases the gap between synthetic data and the real world, which is inspiring for addressing data hungry problems with synthetic data.

Vi Future Works and Discussion

The ”data hungry” problem is increasingly being recognized as a serious and widespread problem for 3D LiDAR semantic segmentation. However, solutions for the data hungry problem in 3D LiDAR have still been a largely underexplored domain compared with studies in computer vision and machine learning. Developing new methods that rely less on fine annotated 3D LiDAR data and developing more diversified 3D LiDAR datasets could become two main directions to focus on. Below, we elaborate on future works in these directions, followed by a discussion on the open questions, which leads to important, but until now, little studied topics.

Vi-a Methodology

Compared to the vast number of methods using visual images, studies on 3D LiDAR data are very limited in both breadth and depth, and are usually sporadic, premature and unsystematic. Some new tendencies, such as few-shot learning, to the best of our knowledge, have not been attempted on 3D LiDAR data for semantic segmentation or relevant tasks. Below, we discuss the potential future topics on these aspects.

Vi-A1 Bounding boxes

The bounding box has been used as a weak supervision signal in developing many semantic segmentation methods of visual images[112][113][107][104], and bounding boxes are actually available in many open datasets[41][42][43][44], as reviewed in Table I. Can we make use of this information in processing 3D LiDAR data? The idea of solving the data hungry problem in 3D LiDAR seems to be absent in the efforts until now.

Vi-A2 Prior knowledge

Different from visual images, 3D LiDAR captures real-world data of true physical size and spatial geometry. Many prior knowledge can be used. For example, if a LiDAR sensor is mounted on a vehicle, it is easy to find an approximate elevation value of the ground surface by making use of the prior knowledge that the vehicle is on the ground and the geometry parameters from sensor to ground are calibrated. This prior knowledge can greatly help to save learning costs.

Vi-A3 Spatial-temporal constraints

Semantic segmentation of video has been studied in computer vision and multimedia societies[178]; however, 3D LiDAR data have been mostly processed frame by frame, even for those collected sequentially from a moving vehicle. Such processing ignores temporal continuity and coherence cues, which can help to save computation time, improve accuracy and reduce the needs of fine annotated data by introducing new constraints in a semi-supervised training method.

Vi-A4 Self-supervised learning

To make use of the large quantity of unlabeled data, self-supervised learning sets the learning objectives by composing a pretext task to obtain supervision from the data. Such an idea is still rare in 3D LiDAR processing, whereas various pretext tasks have been designed on images to learn meaningful representations, such as context prediction[133][134], inpainting[135], and temporal correlation[138][139], which may inspire new directions in the 3D LiDAR domain.

Vi-A5 Few-shot learning

The ability to learn from few labeled samples could be the fundamental solution of the data hungry problem. Few-shot learning, as the name implies, generalizes new tasks from a few supervisions and prior knowledge, as a human being does. Shaban et al.[148] was among the first to introduce such techniques for semantic segmentation. The idea of few-shot learning is inspiring, but the studies are still in the early stages and to our knowledge, few-shot learning has not been applied to 3D LiDAR data.

Fig. 18: Overview of future works and open questions.

Vi-B Datasets

The superior performance of deep learning methods is usually established on a large quantity of fine annotated datasets. However, currently, 3D LiDAR datasets are very limited in terms of both size and diversity, which is a bottleneck that restricts the studies and technical progress in this field. On the one hand, the efforts of dataset generation should receive more attention and be more highly evaluated. On the other hand, how can data generation be more efficient and less labor intensive? More research is needed, where some potential topics are discussed below.

Vi-B1 Label transfer

Label transfer is a cheap solution to obtain more 3D LiDAR annotations, which borrows labels from other modalities, such as images. The impressive performance of state-of-the-art methods in image semantic segmentation makes it possible to transfer labels from images to 3D LiDAR data[165][166][164]. However, existing methods usually project image results to 3D LiDAR data, where both correct and incorrect annotations are transferred; therefore, the quality is not guaranteed. Error detection mechanisms are needed in this procedure, where prior knowledge, such as the size, geometry, spatial and temporal coherence of 3D LiDAR data, could be important cues to filter false alarms.

Vi-B2 Semi-automatic annotation

Semiautomatic annotation is an attempt to find a balance between the efficiency and quality of data generation. Some studies[179][180] use semi-automatic methods to accelerate the labeling of 3D bounding boxes, which automatically combines the results of object tracking with manual modification by human operators. Semiautomatic annotation for point-level semantic segmentation is more complicated than 3D bounding boxes. CRF-based methods[181][163] are developed to utilize category consistency between neighboring points, but the cost performance improvements are still limited. In the future, techniques such as active learning[182] and online learning[183] may further boost the semi-automatic annotation process.

Vi-B3 Data augmentation

Data augmentation is a commonly used trick for enriching data diversity, which has been proven to achieve gains compared to architectural improvements, both in 2D image[184] and 3D LiDAR object detection tasks[185]. However, common augmentation operations such as random flip, rotation and scaling, which are useful for single objects, may cause discordance of context elements for semantic segmentation. Liu et al.[186] attempted to use generative adversarial networks for pixel-wise data augmentation of image semantic segmentation. To the best of our knowledge, there is no systematic research on point-wise data augmentation for 3D LiDAR semantic segmentation, so there is considerable room for growth.

Vi-C Open Questions

Vi-C1 How do models handle unknown data in open sets?

Currently, studies mostly design algorithms and evaluate model performance in a ”closed set”[187][188], which assumes that the testing set obeys the same distribution as the training set. However, practical applications in the real world are an ”open set” problem, and deep models will always be data hungry in unseen categories and new scenarios. It is impossible to have knowledge in all categories or collect endless data for training.

A general idea from image recognition[187] formalized the open set challenge as a constrained minimization problem, which aims to balance empirical and open space risk and find tight bounds for known categories in feature space. Several studies[189][190][191] followed and extended this idea, but existing methods are still insufficient for handling this threat[192]. Open set is still a widespread challenge, especially for safety critical applications such as autonomous driving[193] and adversarial attacks[194].

Few studies[188] have been conducted on the semantic segmentation open set problem because of its complexity. Even including no unseen categories, a scene could be unique and belong to the ”open set” for models. Future research on the open set problem is expected to identify how to address unknown scenes in open sets with limited training data, which could be key to solving data hungry problems in the future.

Vi-C2 How to evaluate and dispose dataset bias?

Each dataset is a sampling from the real world, and an ideal dataset should be a uniform and dense enough sample. However, it is ineluctable for actual datasets to be a biased sampling of the real world. Dataset bias is not an isolated problem for 3D LiDAR but is widespread in most domains.

To explore the actuality of dataset bias, Torralba and Efros[21] designed interesting experiments for object recognition datasets. By naming the dataset to which images belong, it is obvious for almost all datasets to have a strong built-in bias. To evaluate its influence, Torralba et al.[21] carried out cross-dataset generalization experiments. As a result, all tests on other datasets showed significant decline compared with tests on the same dataset as training.

Furthermore, Khosla et al.[195]

tried to find solutions for undoing the damage of dataset bias by learning bias vectors and visual world weights. However, Tommasi et al.

[196] found that many existing ad hoc learning algorithms for undoing the dataset bias do not help for CNN features.

Despite several discussions[21][195][197][196], dataset bias is still unavoidable by data collectors or ad hoc algorithms. Cross-dataset validation could be one method for evaluating dataset bias influence, but for 3D LiDAR semantic segmentation, various data acquisition methods and different applications of datasets make it extraordinarily difficult to use sufficient datasets for cross-testing.

Dataset bias is one main factor causing hungry data diversity. Although there has not yet been a satisfactory answer for disposing of dataset bias, we expect researchers to notice the correlation between dataset bias and data hungry problems, which may be valuable for future works.

Vi-C3 How is the semantic gap between datasets addressed?

Category definitions of datasets are mostly different, which could be influenced by several factors:

  • Different semantic hierarchy of the labels, such as ”vehicle” versus ”car, bus, truck”.

  • Heterogeneous semantic contexts included in biased datasets.

  • The annotation budget limits the number of labels.

These factors lead to noteworthy semantic gaps between datasets, including different semantic granularity, label numbers and distinguishing contents of the same label.

Training models on more datasets simultaneously is a direct idea when facing data hungry problems, but it is hindered by semantic gaps. The experimental results in [198] showed that by directly training on more datasets, the performances of all models decreas as a general trend. Meletis and Dubbelman[199] proposed hierarchical classifiers for semantic segmentation that could be trained on multiple, heterogeneous datasets. However, the specific designed framework and compatible combination of selected datasets make it difficult to widely generalize to other tasks and datasets.

Few studies have considered how to define a category list scientifically, and there exist many questions, such as ”whether is the difference between ”bicyclist” and ”motorcyclist” is recognizable for sparse 3D LiDAR data? ”, ”whether it will confuse deep models if we merge ”car” and ”bus” into ”vehicle”? ”, and ”how many categories should be defined in a dataset? ”, which deserve more research.

Scientific guidance of semantic definition will significantly benefit overcoming semantic gaps. For data-driven deep models, solving semantic gaps could avoid misleading training by confusing labels while making it convenient for the combination of heterogeneous datasets. Finally, it will contribute to solving data hungry problems.

Vii Conclusion

In this paper, we focus on the data hungry problem in the domain of 3D LiDAR semantic segmentation. Datasets and methodologies are two main factors of the data hungry problem. First, we gathered existing open 3D LiDAR datasets and performed a detailed analysis of three representative datasets, Semantic3D, SemanticKITTI and SemanticPOSS, which showed great differences caused by LiDAR acquisition and data diversity. Second, we reviewed current methods of 3D LiDAR semantic segmentation, especially deep learning methods, which are the first to be affected by the data hungry problem. Furthermore, to better understand the current effect of the data hungry problem, we performed abundant experimental analysis and showed its impacts in different dimensions. Efforts to solve the data hungry problem were reviewed, including general methods and those focused on 3D LiDAR semantic segmentation.

The data hungry problem is a factual challenge for 3D LiDAR semantic segmentation, but various potential directions have not been well explored. We hope our work will be instructive for further studies and discussions about the data hungry of 3D LiDAR.

References

  • [1] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley: The robot that won the DARPA grand challenge,” Journal of field Robotics, vol. 23, no. 9, pp. 661–692, 2006.
  • [2] B. J. Patz, Y. Papelis, R. Pillat, G. Stein, and D. Harper, “A practical approach to robotic design for the DARPA urban challenge,” Journal of Field Robotics, vol. 25, no. 8, pp. 528–566, 2008.
  • [3] J. Zhang and S. Singh, “Loam: LiDAR odometry and mapping in real-time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014.
  • [4] W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-time loop closure in 2D LiDAR SLAM,” in 2016 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2016, pp. 1271–1278.
  • [5] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3D LiDAR using fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
  • [6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 1907–1915.
  • [7] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys, “Semantic3D.net: A new large-scale point cloud classification benchmark,” arXiv preprint arXiv:1704.03847, 2017.
  • [8] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences,” in IEEE International Conference on Computer Vision, 2019, pp. 9297–9307.
  • [9] R. B. Rusu and S. Cousins, “3D is here: Point cloud library (pcl),” in 2011 IEEE international conference on robotics and automation.   IEEE, 2011, pp. 1–4.
  • [10] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [11] B. Wu, A. Wan, X. Yue, and K. Keutzer, “SqueezeSeg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3D LiDAR point cloud,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1887–1893.
  • [12] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez, “A review on deep learning techniques applied to semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.
  • [13] H. Yu, Z. Yang, L. Tan, Y. Wang, W. Sun, M. Sun, and Y. Tang, “Methods and datasets on semantic segmentation: A review,” Neurocomputing, vol. 304, pp. 82–103, 2018.
  • [14] Y. Xie, J. Tian, and X. X. Zhu, “A review of point cloud semantic segmentation,” arXiv preprint arXiv:1908.08854, 2019.
  • [15] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3D point clouds: A survey,” arXiv preprint arXiv:1912.12033, 2019.
  • [16] H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation,” Journal of Visual Communication and Image Representation, vol. 34, pp. 12–27, 2016.
  • [17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [18] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
  • [19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [20] X.-W. Chen and X. Lin, “Big data deep learning: challenges and perspectives,” IEEE access, vol. 2, pp. 514–525, 2014.
  • [21] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR 2011.   IEEE, 2011, pp. 1521–1528.
  • [22] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” in IEEE international conference on computer vision, 2017, pp. 843–852.
  • [23] G. Marcus, “Deep learning: A critical appraisal,” arXiv preprint arXiv:1801.00631, 2018.
  • [24] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3D semantic parsing of large-scale indoor spaces,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1534–1543.
  • [25] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, no. 1, pp. 98–136, 2015.
  • [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [27] A. Nguyen and B. Le, “3D point cloud segmentation: A survey,” in 2013 6th IEEE conference on robotics, automation and mechatronics (RAM).   IEEE, 2013, pp. 225–230.
  • [28] E. Grilli, F. Menna, and F. Remondino, “A review of point clouds segmentation and classification algorithms,” The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 42, p. 339, 2017.
  • [29] F. Lateef and Y. Ruichek, “Survey on semantic segmentation using deep learning techniques,” Neurocomputing, vol. 338, pp. 321–348, 2019.
  • [30] K. Vodrahalli and A. K. Bhowmik, “3D computer vision based on machine learning with deep neural networks: A review,” Journal of the Society for Information Display, vol. 25, no. 11, pp. 676–694, 2017.
  • [31] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, “Deep learning advances in computer vision with 3D data: A survey,” ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–38, 2017.
  • [32] D. Griffiths and J. Boehm, “A review on deep learning techniques for 3D sensed data classification,” Remote Sensing, vol. 11, no. 12, p. 1499, 2019.
  • [33] S. A. Bello, S. Yu, and C. Wang, “Review: deep learning on 3D point clouds,” arXiv preprint arXiv:2001.06280, 2020.
  • [34] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, 2020.
  • [35] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert, “Contextual classification with functional max-margin markov networks,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2009, pp. 975–982.
  • [36] A. Serna, B. Marcotegui, F. Goulette, and J.-E. Deschaud, “Paris-rue-madame database: a 3D mobile laser scanner dataset for benchmarking urban detection, segmentation and classification methods,” 2014.
  • [37] B. Vallet, M. Brédif, A. Serna, B. Marcotegui, and N. Paparoditis, “Terramobilita/IQmulus urban point cloud analysis benchmark,” Computers & Graphics, vol. 49, pp. 126–133, 2015.
  • [38] X. Roynard, J.-E. Deschaud, and F. Goulette, “Paris-Lille-3D: A large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification,” The International Journal of Robotics Research, vol. 37, no. 6, pp. 545–557, 2018.
  • [39] M. De Deuge, A. Quadros, C. Hung, and B. Douillard, “Unsupervised feature learning for classification of outdoor 3D scans,” in Australasian Conference on Robitics and Automation, vol. 2, 2013, p. 1.
  • [40] Y. Pan, B. Gao, J. Mei, S. Geng, C. Li, and H. Zhao, “SemanticPOSS: A point cloud dataset with large quantity of dynamic instances,” arXiv preprint arXiv:2002.09147, 2020.
  • [41] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 3354–3361.
  • [42] A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The H3D dataset for full-surround 3D multi-object detection and tracking in crowded urban scenes,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 9552–9557.
  • [43] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
  • [44] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” arXiv, pp. arXiv–1912, 2019.
  • [45] X. Yue, B. Wu, S. A. Seshia, K. Keutzer, and A. L. Sangiovanni-Vincentelli, “A LiDAR point cloud generator: from a virtual world to autonomous driving,” in 2018 ACM on International Conference on Multimedia Retrieval.   ACM, 2018, pp. 458–464.
  • [46] D. Griffiths and J. Boehm, “SynthCity: A large scale synthetic point cloud,” arXiv preprint arXiv:1907.04758, 2019.
  • [47] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
  • [48] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in European Conference on Computer Vision.   Springer, 2012, pp. 746–760.
  • [49] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D reconstructions of indoor scenes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5828–5839.
  • [50] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The ApolloScape open dataset for autonomous driving and its application,” arXiv preprint arXiv:1803.06184, 2018.
  • [51] RUDI, “3D modelling above and below ground,” 2014.
  • [52] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham, “RandLA-Net: Efficient semantic segmentation of large-scale point clouds,” arXiv preprint arXiv:1911.11236, 2019.
  • [53] Y. Jin, L. Xiangfeng, W. Yang, S. Xie, and T. Liu, “A panoramic segmentation network for point cloud,” E&ES, vol. 440, no. 3, p. 032016, 2020.
  • [54] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “SqueezeSegV2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 4376–4382.
  • [55] Hesaitech.com, “Pandora-hesai,” https://www.hesaitech.com/en/Pandora.
  • [56] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena, “Contextually guided semantic labeling and search for three-dimensional point clouds,” The International Journal of Robotics Research, vol. 32, no. 1, pp. 19–34, 2013.
  • [57] D. Wolf, J. Prankl, and M. Vincze, “Fast semantic segmentation of 3D point clouds using a dense crf with learned parameters,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2015, pp. 4867–4873.
  • [58] H. Thomas, F. Goulette, J.-E. Deschaud, and B. Marcotegui, “Semantic classification of 3D point clouds with multiscale spherical neighborhoods,” in 2018 International Conference on 3D Vision (3DV).   IEEE, 2018, pp. 390–398.
  • [59] M. Weinmann, B. Jutzi, S. Hinz, and C. Mallet, “Semantic point cloud interpretation based on optimal neighborhoods, relevant features and efficient classifiers,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 105, pp. 286–304, 2015.
  • [60] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in neural information processing systems, 2017, pp. 5099–5108.
  • [61] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe, “Exploring spatial context for 3D semantic segmentation of point clouds,” in IEEE International Conference on Computer Vision Workshops, 2017, pp. 716–724.
  • [62] M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu, “PointSIFT: A sift-like network module for 3D point cloud semantic segmentation.”
  • [63] F. Engelmann, T. Kontogianni, J. Schult, and B. Leibe, “Know what your neighbors do: 3D semantic segmentation of point clouds,” in European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
  • [64] J. Li, B. M. Chen, and G. Hee Lee, “SO-Net: Self-organizing network for point cloud analysis,” in IEEE conference on computer vision and pattern recognition, 2018, pp. 9397–9406.
  • [65] T. Kohonen, “The self-organizing map,” IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.
  • [66] J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, and Q. Tian, “Modeling point clouds with self-attention and gumbel subset sampling,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3323–3332.
  • [67] L.-Z. Chen, X.-Y. Li, D.-P. Fan, M.-M. Cheng, K. Wang, and S.-P. Lu, “LSANet: Feature learning on point sets by local spatial attention,” arXiv preprint arXiv:1905.05442, 2019.
  • [68] K. Zhiheng and L. Ning, “PyramNet: Point cloud pyramid attention network and graph embedding module for classification and segmentation,” arXiv preprint arXiv:1906.03299, 2019.
  • [69] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Convolution on x-transformed points,” in Advances in neural information processing systems, 2018, pp. 820–830.
  • [70] A. Komarichev, Z. Zhong, and J. Hua, “A-CNN: Annularly convolutional neural networks on point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7421–7430.
  • [71] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “KPConv: Flexible and deformable convolution for point clouds,” in IEEE International Conference on Computer Vision, 2019, pp. 6411–6420.
  • [72] F. Engelmann, T. Kontogianni, and B. Leibe, “Dilated point convolutions: On the receptive field of point convolutions,” arXiv preprint arXiv:1907.12046, 2019.
  • [73] L. Pan, P. Wang, and C.-M. Chew, “PointAtrousNet: Point atrous convolution for point cloud analysis,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4035–4041, 2019.
  • [74] L. Pan, C.-M. Chew, and G. H. Lee, “PointAtrousGraph: Deep hierarchical encoder-decoder with atrous convolution for point clouds,” arXiv preprint arXiv:1907.09798, 2019.
  • [75] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou, “Tangent convolutions for dense prediction in 3D,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3887–3896.
  • [76] Z. Zhao, M. Liu, and K. Ramani, “DAR-Net: Dynamic aggregation network for semantic scene segmentation,” arXiv preprint arXiv:1907.12022, 2019.
  • [77] Z. Zhang, B.-S. Hua, and S.-K. Yeung, “ShellNet: Efficient point cloud convolutional neural networks using concentric shells statistics,” in IEEE International Conference on Computer Vision, 2019, pp. 1607–1616.
  • [78] Q. Huang, W. Wang, and U. Neumann, “Recurrent slice networks for 3D segmentation of point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2626–2635.
  • [79] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz, “SPLATNet: Sparse lattice networks for point cloud processing,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2530–2539.
  • [80] V. Jampani, M. Kiefel, and P. V. Gehler, “Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4452–4461.
  • [81] R. A. Rosu, P. Schütt, J. Quenzel, and S. Behnke, “LatticeNet: Fast point cloud segmentation using permutohedral lattices,” arXiv preprint arXiv:1912.05905, 2019.
  • [82] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [83] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg, “Deep projective 3D semantic segmentation,” in International Conference on Computer Analysis of Images and Patterns.   Springer, 2017, pp. 95–107.
  • [84] A. Boulch, B. Le Saux, and N. Audebert, “Unstructured point cloud semantic labeling using deep segmentation networks.” 3DOR, vol. 2, p. 7, 2017.
  • [85] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
  • [86] J. Mei, B. Gao, D. Xu, W. Yao, X. Zhao, and H. Zhao, “Semantic segmentation of 3D LiDAR data in dynamic scene using semi-supervised learning,” IEEE Transactions on Intelligent Transportation Systems, 2019.
  • [87] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “RangeNet++: Fast and accurate LiDAR semantic segmentation,” in Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2019.
  • [88] A. Dewan and W. Burgard, “DeepTemporalSeg: Temporally consistent semantic segmentation of 3D LiDAR scans,” arXiv preprint arXiv:1906.06962, 2019.
  • [89] W. Zhang, C. Zhou, J. Yang, and K. Huang, “LiSeg: Lightweight road-object semantic segmentation in 3D LiDAR scans for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 1021–1026.
  • [90] Y. Wang, T. Shi, P. Yun, L. Tai, and M. Liu, “PointSeg: Real-time semantic segmentation based on 3D LiDAR point cloud,” arXiv preprint arXiv:1807.06288, 2018.
  • [91] P. Biasutti, A. Bugeau, J.-F. Aujol, and M. Brédif, “RIU-Net: Embarrassingly simple semantic segmentation of 3D LiDAR point cloud,” arXiv preprint arXiv:1905.08748, 2019.
  • [92] J. Huang and S. You, “Point cloud labeling using 3D convolutional neural network,” in 2016 23rd International Conference on Pattern Recognition (ICPR).   IEEE, 2016, pp. 2670–2675.
  • [93] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese, “SegCloud: Semantic segmentation of 3D point clouds,” in 2017 international conference on 3D vision (3DV).   IEEE, 2017, pp. 537–547.
  • [94] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari, “Fully-convolutional point networks for large-scale point clouds,” in European Conference on Computer Vision (ECCV), 2018, pp. 596–611.
  • [95] B. Graham, M. Engelcke, and L. van der Maaten, “3D semantic segmentation with submanifold sparse convolutional networks,” in IEEE conference on computer vision and pattern recognition, 2018, pp. 9224–9232.
  • [96] C. Zhang, W. Luo, and R. Urtasun, “Efficient convolutions for real-time semantic segmentation of 3D point clouds,” in 2018 International Conference on 3D Vision (3DV).   IEEE, 2018, pp. 399–408.
  • [97] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner, “ScanComplete: Large-scale scene completion and semantic segmentation for 3D scans,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4578–4587.
  • [98] H.-Y. Meng, L. Gao, Y.-K. Lai, and D. Manocha, “VV-Net: Voxel vae net with group convolutions for point cloud segmentation,” in IEEE International Conference on Computer Vision, 2019, pp. 8500–8508.
  • [99] H. Radi and W. Ali, “VolMap: A real-time model for semantic segmentation of a LiDAR surrounding view,” arXiv preprint arXiv:1906.11873, 2019.
  • [100]

    F. Liu, S. Li, L. Zhang, C. Zhou, R. Ye, Y. Wang, and J. Lu, “3DCNN-DQN-RNN: A deep reinforcement learning framework for semantic parsing of large-scale 3D point clouds,” in

    IEEE International Conference on Computer Vision, 2017, pp. 5678–5687.
  • [101] L. Landrieu and M. Simonovsky, “Large-scale point cloud semantic segmentation with superpoint graphs,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4558–4567.
  • [102] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan, “Graph attention convolution for point cloud semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 296–10 305.
  • [103] L. Jiang, H. Zhao, S. Liu, X. Shen, C.-W. Fu, and J. Jia, “Hierarchical point-edge interaction network for point cloud semantic segmentation,” in IEEE International Conference on Computer Vision, 2019, pp. 10 433–10 441.
  • [104] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation,” in IEEE international conference on computer vision, 2015, pp. 1742–1750.
  • [105] D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” arXiv preprint arXiv:1412.7144, 2014.
  • [106] D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolutional neural networks for weakly supervised segmentation,” in IEEE international conference on computer vision, 2015, pp. 1796–1804.
  • [107] S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele, “Exploiting saliency for object segmentation from image level labels,” in 2017 IEEE conference on computer vision and pattern recognition (CVPR).   IEEE, 2017, pp. 5038–5047.
  • [108] W. Shimoda and K. Yanai, “Distinct class-specific saliency maps for weakly supervised semantic segmentation,” in European Conference on Computer Vision.   Springer, 2016, pp. 218–234.
  • [109] A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Three principles for weakly-supervised image segmentation,” in European Conference on Computer Vision.   Springer, 2016, pp. 695–711.
  • [110] P. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling with convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1713–1721.
  • [111] S. Kwak, S. Hong, and B. Han, “Weakly supervised semantic segmentation using superpixel pooling network,” in

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [112] W. Xia, C. Domokos, J. Dong, L.-F. Cheong, and S. Yan, “Semantic segmentation without annotating segments,” in IEEE international conference on computer vision, 2013, pp. 2176–2183.
  • [113] J. Dai, K. He, and J. Sun, “BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in IEEE International Conference on Computer Vision, 2015, pp. 1635–1643.
  • [114] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “ScribbleSup: Scribble-supervised convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3159–3167.
  • [115] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, “What’s the point: Semantic segmentation with point supervision,” in European conference on computer vision.   Springer, 2016, pp. 549–565.
  • [116] S. Hong, H. Noh, and B. Han, “Decoupled deep neural network for semi-supervised semantic segmentation,” in Advances in neural information processing systems, 2015, pp. 1495–1503.
  • [117] A. Laddha and M. Hebert, “Improving semantic scene understanding using prior information,” in Unmanned Systems Technology XVIII, vol. 9837.   International Society for Optics and Photonics, 2016, p. 98370Q.
  • [118] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel, “MixMatch: A holistic approach to semi-supervised learning,” in Advances in Neural Information Processing Systems, 2019, pp. 5050–5060.
  • [119] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in neural information processing systems, 2017, pp. 1195–1204.
  • [120] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, “Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2517–2526.
  • [121] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in European conference on computer vision (ECCV), 2018, pp. 289–305.
  • [122] A. Hu, A. Kendall, and R. Cipolla, “Learning a spatio-temporal embedding for video instance segmentation,” arXiv preprint arXiv:1912.08969, 2019.
  • [123] Y. He, W.-C. Chiu, M. Keuper, and M. Fritz, “STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4837–4846.
  • [124] D. Barnes, W. Maddern, and I. Posner, “Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 203–210.
  • [125] W. Zhou, S. Worrall, A. Zyner, and E. Nebot, “Automated process for incorporating drivable path into real-time semantic segmentation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1–6.
  • [126] T. Migishima, H. Kyutoku, D. Deguchi, Y. Kawanishi, I. Ide, and H. Murase, “Scene-adaptive driving area prediction based on automatic label acquisition from driving information,” in Asian Conference on Pattern Recognition.   Springer, 2019, pp. 106–117.
  • [127] T. Bruls, W. Maddern, A. A. Morye, and P. Newman, “Mark yourself: Road marking segmentation via weakly-supervised annotations from multimodal data,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1863–1870.
  • [128] X. Xu and G. H. Lee, “Weakly supervised semantic point cloud segmentation:towards 10X fewer labels,” 2020.
  • [129] J. Mei and H. Zhao, “Incorporating human domain knowledge in 3D LiDAR-based semantic segmentation,” IEEE Transactions on Intelligent Vehicles, 2019.
  • [130] K. Xu, Y. Yao, K. Murasaki, S. Ando, and A. Sagata, “Semantic segmentation of sparsely annotated 3D point clouds by pseudo-labelling,” in 2019 International Conference on 3D Vision (3DV).   IEEE, 2019, pp. 463–471.
  • [131] D.-K. Kim, D. Maturana, M. Uenoyama, and S. Scherer, “Season-invariant semantic segmentation with a deep multimodal network,” in Field and service robotics.   Springer, 2018, pp. 255–270.
  • [132] J. Mei, L. Zhang, Y. Wang, Z. Zhu, and H. Ding, “Joint margin, cograph, and label constraints for semisupervised scene parsing from point clouds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 7, pp. 3800–3813, 2018.
  • [133]

    M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in

    European Conference on Computer Vision.   Springer, 2016, pp. 69–84.
  • [134] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in IEEE International Conference on Computer Vision, 2015, pp. 1422–1430.
  • [135] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544.
  • [136] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual understanding,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6874–6883.
  • [137]

    R. Zhang, P. Isola, and A. A. Efros, “Split-Brain autoencoders: Unsupervised learning by cross-channel prediction,” in

    IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1058–1067.
  • [138] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in IEEE International Conference on Computer Vision, 2015, pp. 2794–2802.
  • [139] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2701–2710.
  • [140] X. Zhan, Z. Liu, P. Luo, X. Tang, and C. C. Loy, “Mix-and-match tuning for self-supervised semantic segmentation,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [141] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting self-supervised learning via knowledge transfer,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.
  • [142] J. Sauder and B. Sievers, “Self-supervised deep learning on point clouds by reconstructing space,” in Advances in Neural Information Processing Systems, 2019, pp. 12 942–12 952.
  • [143] A. Maligo and S. Lacroix, “Classification of outdoor 3D LiDAR data based on unsupervised gaussian mixture models,” IEEE Transactions on Automation Science and Engineering, vol. 14, no. 1, pp. 5–16, 2016.
  • [144] M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh, “Unsupervised domain adaptation for semantic segmentation of urban scenes,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  • [145] E. Romera, L. M. Bergasa, K. Yang, J. M. Alvarez, and R. Barea, “Bridging the day and night domain gap for semantic segmentation,” in 2019 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2019, pp. 1312–1318.
  • [146] M. Jaritz, T.-H. Vu, R. de Charette, É. Wirbel, and P. Pérez, “xMUDA: Cross-modal unsupervised domain adaptation for 3D semantic segmentation,” arXiv preprint arXiv:1911.12676, 2019.
  • [147] M. Abdou, M. Elkhateeb, I. Sobh, and A. Elsallab, “End-to-end 3D-pointcloud semantic segmentation for autonomous driving,” arXiv preprint arXiv:1906.10964, 2019.
  • [148] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots, “One-shot learning for semantic segmentation,” arXiv preprint arXiv:1709.03410, 2017.
  • [149] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen, “Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5217–5226.
  • [150] T. Hu, P. Yang, C. Zhang, G. Yu, Y. Mu, and C. G. Snoek, “Attention-based multi-context guiding for few-shot semantic segmentation,” in AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8441–8448.
  • [151] N. Dong and E. Xing, “Few-shot semantic segmentation with prototype learning.” in BMVC, vol. 3, no. 4, 2018.
  • [152] R. L. Solso, M. K. MacLin, and O. H. MacLin, Cognitive psychology.   Pearson Education New Zealand, 2005.
  • [153] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in neural information processing systems, 2017, pp. 4077–4087.
  • [154] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “PANet: Few-shot image semantic segmentation with prototype alignment,” in IEEE International Conference on Computer Vision, 2019, pp. 9197–9206.
  • [155] P. Tian, Z. Wu, L. Qi, L. Wang, Y. Shi, and Y. Gao, “Differentiable meta-learning model for few-shot semantic segmentation,” arXiv preprint arXiv:1911.10371, 2019.
  • [156] M. Bucher, T.-H. Vu, M. Cord, and P. Pérez, “Zero-shot semantic segmentation,” arXiv preprint arXiv:1906.00817, 2019.
  • [157] F. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez, “Built-in foreground/background prior for weakly-supervised semantic segmentation,” in European Conference on Computer Vision.   Springer, 2016, pp. 413–432.
  • [158] A. Kolesnikov and C. H. Lampert, “Improving weakly-supervised object localization by micro-annotation,” arXiv preprint arXiv:1605.05538, 2016.
  • [159] A. Petrovai, A. D. Costea, and S. Nedevschi, “Semi-automatic image annotation of street scenes,” in 2017 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2017, pp. 448–455.
  • [160] I. Alonso and A. C. Murillo, “Semantic segmentation from sparse labeling using multi-level superpixels,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 5785–5792.
  • [161] R. Mackowiak, P. Lenz, O. Ghori, F. Diego, O. Lange, and C. Rother, “Cereals-cost-effective region-based active learning for semantic segmentation,” arXiv preprint arXiv:1810.09726, 2018.
  • [162] Z. Li, L. Zhang, R. Zhong, T. Fang, L. Zhang, and Z. Zhang, “Classification of urban point clouds: A robust supervised approach with automatically generating training data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 3, pp. 1207–1220, 2016.
  • [163] H. Zhang, J. Wang, T. Fang, and L. Quan, “Joint segmentation of images and scanned point cloud in large-scale street scenes with low-annotation cost,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4763–4772, 2014.
  • [164] Z. Chen, Q. Liao, Z. Wang, Y. Liu, and M. Liu, “Image detector based automatic 3D data labeling and training for vehicle detection on point cloud,” in 2019 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2019, pp. 1408–1413.
  • [165] F. Piewak, P. Pinggera, M. Schafer, D. Peter, B. Schwarz, N. Schneider, M. Enzweiler, D. Pfeiffer, and M. Zollner, “Boosting LiDAR-based semantic labeling by cross-modal training data generation,” in European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
  • [166] R. Varga, A. Costea, H. Florea, I. Giosan, and S. Nedevschi, “Super-sensor for 360-degree environment perception: Point cloud segmentation using image features,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2017, pp. 1–8.
  • [167] H. Luo, C. Wang, C. Wen, Z. Chen, D. Zai, Y. Yu, and J. Li, “Semantic labeling of mobile LiDAR point clouds via active learning and higher order MRF,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 7, pp. 3631–3644, 2018.
  • [168] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European conference on computer vision.   Springer, 2016, pp. 102–118.
  • [169] S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in IEEE International Conference on Computer Vision, 2017, pp. 2213–2222.
  • [170] P. Krähenbühl, “Free supervision from video games,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2955–2964.
  • [171] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
  • [172] F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez, “Effective use of synthetic data for urban scene semantic segmentation,” in European Conference on Computer Vision.   Springer, 2018, pp. 86–103.
  • [173] “Unity3D.” [Online]. Available: https://unity.com/
  • [174] F. Wang, Y. Zhuang, H. Gu, and H. Hu, “Automatic generation of synthetic LiDAR point clouds for 3-d data analysis,” IEEE Transactions on Instrumentation and Measurement, vol. 68, no. 7, pp. 2671–2673, 2019.
  • [175] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” arXiv preprint arXiv:1711.03938, 2017.
  • [176] J. Fang, D. Zhou, F. Yan, T. Zhao, F. Zhang, Y. Ma, L. Wang, and R. Yang, “Augmented LiDAR simulator for autonomous driving,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1931–1938, 2020.
  • [177] J. Fang, F. Yan, T. Zhao, F. Zhang, D. Zhou, R. Yang, Y. Ma, and L. Wang, “Simulating LiDAR point cloud for autonomous driving using real-world scenes and traffic flows,” arXiv preprint arXiv:1811.07112, vol. 1, 2018.
  • [178] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, and J. Garcia-Rodriguez, “A survey on deep learning techniques for image and video semantic segmentation,” Applied Soft Computing, vol. 70, pp. 41–65, 2018.
  • [179] W. Zimmer, A. Rangesh, and M. Trivedi, “3D BAT: A Semi-Automatic, Web-based 3D annotation toolbox for full-surround, multi-modal data streams,” IEEE Intelligent Vehicles Symposium, vol. 2019-June, pp. 1816–1821, 2019.
  • [180] C. Plachetka, J. Rieken, and M. Maurer, “The TUBS Road User Dataset: A New LiDAR Dataset and its Application to CNN-based Road User Classification for Automated Vehicles,” IEEE Conference on Intelligent Transportation Systems (ITSC), vol. 2018-Novem, pp. 2623–2630, 2018.
  • [181] E. H. Lim and D. Suter, “3D terrestrial LiDAR classifications with super-voxels and multi-scale conditional random fields,” Computer-Aided Design, vol. 41, no. 10, pp. 701 – 710, 2009.
  • [182] H. Luo, C. Wang, C. Wen, Z. Chen, D. Zai, Y. Yu, and J. Li, “Semantic Labeling of Mobile LiDAR Point Clouds via Active Learning and Higher Order MRF,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 7, pp. 3631–3644, 2018.
  • [183] Z. Yan, T. Duckett, and N. Bellotto, “Online learning for 3D LiDAR-based human detection: experimental analysis of point cloud clustering and classification methods,” Autonomous Robots, vol. 44, no. 2, pp. 147–164, 2020.
  • [184] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, vol. 6, no. 1, 2019.
  • [185] S. Cheng, Z. Leng, E. D. Cubuk, B. Zoph, C. Bai, J. Ngiam, Y. Song, B. Caine, V. Vasudevan, C. Li, Q. V. Le, J. Shlens, and D. Anguelov, “Improving 3D Object Detection through Progressive Population Based Augmentation,” pp. 1–18, 2020.
  • [186] S. Liu, J. Zhang, Y. Chen, Y. Liu, Z. Qin, and T. Wan, “Pixel level data augmentation for semantic image segmentation using generative adversarial networks,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 1902–1906.
  • [187] W. J. Scheirer, A. De Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1757–1772, 2013.
  • [188] K. Nogueira, H. N. Oliveira, U. Federal, D. M. Gerais, B. Horizonte, and C. Science, “Towards open-set semantic segmentation of aerial images,” no. 1.
  • [189] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models for open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2317–2324, 2014.
  • [190] A. Bendale and T. Boult, “Towards open world recognition,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, pp. 1893–1902, 2015.
  • [191] A. Bendale and T. E. Boult, “Towards open set deep networks,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 1563–1572, 2016.
  • [192] V. Sehwag, C. Sitawarin, A. N. Bhagoji, D. Cullina, P. Mittal, L. Song, and M. Chiang, “Analyzing the robustness of open-world machine learning,” ACM Conference on Computer and Communications Security, pp. 105–116, 2019.
  • [193] M. S. Ramanagopal, C. Anderson, R. Vasudevan, and M. Johnson-Roberson, “Failing to Learn: Autonomously Identifying Perception Failures for Self-Driving Cars,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3860–3867, 2018.
  • [194] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp).   IEEE, 2017, pp. 39–57.
  • [195] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in European Conference on Computer Vision.   Springer, 2012, pp. 158–171.
  • [196] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars, “A deeper look at dataset bias,” in Domain adaptation in computer vision applications.   Springer, 2017, pp. 37–55.
  • [197] I. Model and L. Shamir, “Comparison of Data Set Bias in Object Recognition Benchmarks,” IEEE Access, vol. 3, no. November, pp. 1953–1962, 2015.
  • [198] M. Leonardi, D. Mazzini, e. E. Schettini, Raimondo”, S. Rota Bulò, C. Snoek, O. Lanz, S. Messelodi, and N. Sebe, “Training efficient semantic segmentation cnns on multiple datasets,” in Image Analysis and Processing – ICIAP 2019.   Cham: Springer International Publishing, 2019, pp. 303–314.
  • [199] P. Meletis and G. Dubbelman, “Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 1045–1050.