Unsupervised Domain Adaptation in LiDAR Semantic Segmentation with Self-Supervision and Gated Adapters

07/20/2021
by   Mrigank Rochan, et al.
HUAWEI Technologies Co., Ltd.
13

In this paper, we focus on a less explored, but more realistic and complex problem of domain adaptation in LiDAR semantic segmentation. There is a significant drop in performance of an existing segmentation model when training (source domain) and testing (target domain) data originate from different LiDAR sensors. To overcome this shortcoming, we propose an unsupervised domain adaptation framework that leverages unlabeled target domain data for self-supervision, coupled with an unpaired mask transfer strategy to mitigate the impact of domain shifts. Furthermore, we introduce gated adapter modules with a small number of parameters into the network to account for target domain-specific information. Experiments adapting from both real-to-real and synthetic-to-real LiDAR semantic segmentation benchmarks demonstrate the significant improvement over prior arts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 9

11/30/2021

ConDA: Unsupervised Domain Adaptation for LiDAR Segmentation via Regularized Domain Concatenation

Transferring knowledge learned from the labeled source domain to the raw...
03/02/2020

LiDARNet: A Boundary-Aware Domain Adaptation Model for Lidar Point Cloud Semantic Segmentation

We present a boundary-aware domain adaptation model for Lidar point clou...
10/23/2020

Domain Adaptation in LiDAR Semantic Segmentation

LiDAR semantic segmentation provides 3D semantic information about the e...
05/23/2022

Enhanced Prototypical Learning for Unsupervised Domain Adaptation in LiDAR Semantic Segmentation

Despite its importance, unsupervised domain adaptation (UDA) on LiDAR se...
09/30/2021

Unsupervised Domain Adaptation for LiDAR Panoptic Segmentation

Scene understanding is a pivotal task for autonomous vehicles to safely ...
07/16/2020

Complete Label: A Domain Adaptation Approach to Semantic Segmentation of LiDAR Point Clouds

We study an unsupervised domain adaptation problem for the semantic labe...
08/08/2021

Context-Aware Mixup for Domain Adaptive Semantic Segmentation

Unsupervised domain adaptation (UDA) aims to adapt a model of the labele...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In autonomous driving systems, 3D semantic segmentation plays an indispensable role since it provides precise and robust perception of the surrounding environment. For 3D perception, LiDAR (Light Detection and Ranging) is a commonly used sensor that delivers accurate distance measurements of the encompassing 3D world. As a consequence, LiDAR-based perception has been receiving a lot of scientific interest.

Recently, deep learning approaches have shown to produce promising results for LiDAR semantic segmentation where the goal is to assign a class label to each point in a 3D LiDAR point cloud data. There exist two main families of deep learning approaches for 3D LiDAR semantic segmentation: 1) point-based

landrieu2018large ; qi2017pointnet ; qi2017pointnet++ , where the network directly operates on 3D LiDAR point clouds, and 2) projection-based cortinhal2020salsanext ; milioto2019rangenet++ ; alonso20203d

where 3D LiDAR point clouds are usually projected onto a spherical surface to generate 2D range view (RV) images which are suitable for training 2D convolution neural networks. However, much of the success of these deep learning methods is driven by the huge amount of labeled data that are required for supervision during the training. Moreover, it is not rare to see a well-performing existing model yield a significantly lower performance when it is trained on one dataset (

source domain, e.g., SemanticKITTI behley2019semantickitti ) and tested on another dataset (target domain, e.g., nuScenes caesar2020nuscenes ). This scenario occurs due to the shift in the underlying distributions of the datasets. 3D point clouds suffer from this shift mainly due to variation in LiDAR sensors intrinsic/extrinsic parameters such as number of beams (sensor with more beams generate more dense point clouds data) and sensor placement (point clouds coordinates are relative to the sensor position) corralsoto_et_al_icra2021_lcp ; Alonso2020DomainAI . In the case of 2D RV images, this translates into dissimilarity in the density (or sparsity) of the 2D RV image samples on similar objects from the two different domains, which can be perceived as projection holes and missing lines as depicted in Fig. 5 (a) and (b).

(a)
(b)
(c)
(d)
Figure 5: Example visualization showing the difference in sparsity level of LiDAR point clouds and 2D range view (RV) image projections. (a) and (c) visualize a point cloud and its generated 2D RV image projection from the SemanticKITTI dataset behley2019semantickitti . (b) and (d) visualize a point cloud and its generated 2D RV image projection from the nuScenes dataset caesar2020nuscenes .

Domain adaptation techniques could potentially help address these issues and reduce the impact of domain shift. However, most of the advanced approaches sun2016deep ; lee2019sliced ; vu2019advent ; wang2018deep on domain adaption in semantic segmentation primarily focus on RGB images. In contrast to RGB images, there is lack research on domain adaptation in LiDAR semantic segmentation and research in this area is in its early stage. Therefore, in this work, we develop a domain adaptation strategy for LiDAR semantic segmentation.

We propose an unsupervised domain adaptation (UDA) framework for LiDAR semantic segmentation. Specifically, we introduce three key modules to improve domain adaptation performance of a model on 3D LiDAR data from different domains. Firstly, we present a self-supervised auxiliary task that facilitates feature learning using unlabeled target domain LiDAR data. Secondly, we propose an unpaired mask transfer strategy to reduce the domain shift induced from difference in sparsity level of labeled source and unlabeled target LiDAR data. Lastly, we introduce light-weight gated adapter modules that are inserted in the network to capture target domain-specific knowledge. Note that, in this paper, we introduce these modules in a projection-based LiDAR semantic segmentation method (i.e., SalsaNext cortinhal2020salsanext ) since this method achieves state-of-the-art performance both in terms of speed and accuracy for LiDAR semantic segmentation, which is very crucial for autonomous driving systems. Nevertheless, we seek to improve its ability to perform well on data from different domains.

In summary, the contributions of this paper are as follows: (1) We propose a novel framework for unsupervised domain adaptation (UDA) in projection-based LiDAR semantic segmentation; (2) To bridge the domain gap between a labeled source domain and an unlabeled target domain, we propose three key modules: a self-supervised auxiliary task using target data, an unpaired mask transfer mechanism between source and target data, and the gated adapter modules; and (3) We conduct extensive experiments adapting from both real-to-real and synthetic-to-real autonomous driving datasets to demonstrate the effectiveness of our approach.

2 Related Work

We aim to perform 3D semantic segmentation in this paper that assign a semantic label to each point in 3D LiDAR point cloud data. Traditional methods primarily focus on hand-crafted features from point cloud statistics information and geometric constraints xie2020linking . Recent deep learning methods for 3D LiDAR semantic segmentation achieve promising results while operating mainly in two ways. First, some approaches landrieu2018large ; qi2017pointnet ; qi2017pointnet++ directly work on 3D points by feeding raw point cloud as an input to the network. Second, other approaches alonso20203d ; zhou2018voxelnet ; cortinhal2020salsanext ; wu2018squeezeseg transform 3D point cloud data to another representation (such as image and voxel) and then use it as an input. Recent methods cortinhal2020salsanext ; wu2018squeezeseg that project the LiDAR point cloud onto a 2D image view are gaining popularity since they achieve superior performance and enable direct application of standard 2D convolutions. Our semantic segmentation backbone is based on a state-of-the-art projection-based LiDAR semantic segmentation network, SalsaNext cortinhal2020salsanext , but we focus on improving its domain adaptation ability to 3D point cloud data from different LiDAR sensors.

A major drawback of existing methods is that they often yield inferior performance when there is mismatch in the distribution of training and testing data. Unsupervised Domain Adaptation (UDA) models aim to handle this issue by revealing some unlabeled test (target domain) data in addition to the labeled (source domain) data during the training. There is a dominant line of works (e.g., maximum mean discrepancy long2015learning , adversarial training ganin2016domain ; tzeng2017adversarial ; hoffman2018cycada , etc.) in UDA that follow the core idea of aligning the features from source and target domain such that they are domain-invariant. There is some work that minimizes the entropy vu2019advent

on output probabilities of the unlabeled target domain data. However, these prior methods focus on RGB-data, whereas we focus on UDA for LiDAR point cloud data where relatively little research has been done. Wu et al.

wu2019squeezesegv2 , Qin et al. qin2019pointdan and Jaritz et al. jaritz2020xmuda perform domain adaptation for 3D point clouds but they do not explicitly focus on variations in 3D point cloud data due to difference in LiDAR sensor configurations.

Our work is also related to the domain adaptation approaches that introduce the residual adapter module rebuffi2017_nips ; rebuffi2018_cvpr in an existing network to capture different domain information through a small number of domain-specific parameters. However, their application is currently limited to RGB images. In this paper, we study these residual adapters for domain adaptation in 3D point clouds data.

3 Method

Our goal is to learn an unsupervised domain adaptation (UDA) model for LiDAR semantic segmentation. In the UDA setting, we have a source domain with a set of labeled LiDAR point clouds and a target domain with a set of unlabeled LiDAR point clouds , where and denote two sets of and source and target 3D points, respectively, and indicate the semantic label within different classes for a source domain 3D point . Note that the source and target domain point clouds are captured using different LiDAR sensors, and our aim is to predict semantic label with high accuracy on test LiDAR points from target domain.

We follow prior work cortinhal2020salsanext ; milioto2019rangenet++ and project 3D LiDAR point cloud (from both source and target domain) onto a spherical surface that results in a 2D Range View (RV) image representation (example visualization in Fig. 1 (top)) suitable for standard convolution operations. We adopt this technique since prior research cortinhal2020salsanext shows that projection-based approaches (such as 2D RV) achieve higher accuracy and run significantly faster than the methods that directly operate on raw 3D point clouds. Similar to the prior work cortinhal2020salsanext , in the projection, we store 3D point coordinates, intensity and range values in separate RV image channels. In the end, we obtain image representation for the point cloud, where and denote the height and width of the projected image, respectively. Note that this 2D RV image projection may contain many holes or empty pixels.

At the core of our UDA method is the introduction of three key modules: (i) a range view image completion task which is an auxiliary task for self-supervision from unlabeled target domain data, (ii) an unpaired mask transfer scheme between source and target domain data, and (iii) a gated adapter module to learn target domain-specific information. In the following, we firstly discuss each of these mechanisms in detail. Next, we present our network architecture and its training details.

3.1 Range View Image Completion

On the unlabeled target data (i.e., its RV image representation), we define a self-supervised auxiliary task, namely, Range View Image Completion (RVIC). From the input RV image, we drop the alternate columns and make the model predict these dropped columns as a regression problem. We find this is a simple and empirically effective task at feature learning using target data for the network.

This auxiliary task shares some of the model parameters . It also has its own task-specific parameters and . Pictorially, the overall architecture is a Y-shape (see Fig. 7(a) and Sec. 3.4

for details). We define mean squared error as the loss function

on this auxiliary task.

3.2 Unpaired Mask Transfer

One of the main reasons for domain discrepancy among point clouds from different LiDAR sensors is the difference in their degree of sparsity. Different LiDAR sensors vary in terms of their number of beams, and thereby do not capture the environment in a same way. A sensor with higher number of beams would produce much denser point clouds as opposed to sensor with a lower number of beams. For example, Fig. 5 (c) shows a point cloud captured from a 64-beam LiDAR sensor, which is quite dense in comparison to the point cloud from a 32-beam LiDAR sensor in Fig. 5 (d). This difference in the level of sparsity is also reflected in their respective 2D RV images (Fig. 5 (a) and (b)) in the form of projection holes.

To tackle this issue, we propose an Unpaired Mask Transfer (UMT)

mechanism between source and target domain. The key idea is to match the level of sparsity of source RV images with the target RV images. Our UMT mechanism mainly consists of three steps. First, we use the RVIC mechanism on the source RV image to make it dense by filling its projection holes. Second, we randomly select a target RV image and generate its binary mask whose each entry indicate whether the corresponding pixel in the RV image has a value or not. Finally, we perform elementwise product between the target binary mask and the dense source RV image. This way we obtain a modified source RV image that is aligned with the target RV image in terms of sparsity. Such an alignment between source and target would minimize the domain difference induced by variance in their degree of sparsity and hence would allow the network to perform well on different domains. Note that a source RV image can be randomly paired with target RV image in UMT, and thus the mechanism is unpaired. We also provide the steps involved in the UMT mechanism in Algorithm

1.

3.3 Gated Adapter

Figure 6: The design of the gated adapter (GA) module (in red).

In domain adaptation, there is a line of work that adapts an existing network to a new domain with the help of a small number of domain-specific parameters that are attached to the network to account for domain differences rebuffi2017_nips ; rebuffi2018_cvpr . Rebuffi et al. rebuffi2017_nips propose the residual adapter module that consists of additional parametric convolution layers to learn domain-specific information. We extend this idea and introduce the gated adapter (GA) module. The incoming feature representation of an unlabeled target domain sample is firstly transformed by a light-weight convolution operation with learnable parameters . Next, we introduce a gating mechanism , a learnable scalar which is intialized to 0 and is responsible for weighing the convolutional adjustments through . We express the operation within a gated adapter module as follows:

(1)

The main advantage of the gated adapter module is that it can be plugged into any network and has a very small number learnable domain-specific parameters. The intuition behind learnable gate is to allow the network to gradually learn to assign weight to target domain evidence, thereby regulate the contributions from the convolutional adjustments. We visualize a gated adapter module in Fig. 6.

3.4 Network Architecture

Figure 7: Illustration of our UDA framework. (a) An overview of our proposed UDA network architecture. The architecture consists of an encoder (Enc), the gated adapter modules (GA), a primary decoder (Dec

) that predicts logits for semantic segmentation, and an auxiliary decoder (

Dec_aux) responsible for self-supervision through the auxiliary task RVIC. (b) An overview of our UDA training pipeline. We also provide the training steps (for one batch iteration) of our network in Algorithm 1.

Our UDA network architecture (see Fig. 7)(a) is based on SalsaNext cortinhal2020salsanext which is a state-of-the-art encoder-decoder style LiDAR semantic segmentation network. We adapt SalsaNext for UDA and make a number of modifications to it. One change is that we insert the gated adapter (GA) module next to the convolution operations in each of the ResNet blocks he2016deep of its encoder (Enc). Another change is that we introduce an auxiliary decoder (Dec_aux) that is identical in architecture to the primary decoder (Dec) except the number of output channels in the last convolution layer. Specifically, we obtain the output of dimension from Dec where indicates the number of semantic classes, whereas the output of Dec_aux is of the same dimension (i.e., ) of the input. For source data, we feed its RV image directly as an input to the network. However, for target data, we feed its RV image with dropped alternate columns to perform the RVIC task (see Sec. 3.1).

We denote the parameters in Enc, GA, Dec, and Dec_aux by , and , respectively. In our network, there are (where ) GA modules present in Enc that aim to capture target domain-specific information.

3.5 Training Algorithm and Optimization

Algorithm 1 describes the training procedure of our network. For simplicity, we present the algorithm for just one training batch iteration that includes a batch of labeled source data and a batch unlabeled target data. We also visualize our training pipeline in Fig. 7(b).

On the labeled source data, we compute the weighted cross-entropy loss (similar to cortinhal2020salsanext ) on the prediction of Dec. On the unlabeled target data, we measure the auxiliary loss (see Sec. 3.1) on the prediction of Dec_aux.

The goal of our learning is to minimize the combination of these two losses and find the optimal parameters and in Enc, GA, Dec, and Dec_aux, respectively. Therefore, we aim to solve:

(2)

where

is a hyperparameter that controls the relative importance of the auxiliary loss

.

Arguments:
  1. [leftmargin=*]

  2. model: (Enc, GA, Dec, Dec_aux)

  3. image_s, image_t : 2D RV image projection () of 3D point clouds for source (s) and target (t).

  4. mask_s, mask_t : 2D projection mask () indicating pixels containing valid projection.

  5. label_s : 2D label projection () corresponding to image_s (source only).

  6. , : Fully supervised segmentation loss and self-supervised auxiliary loss, respectively.

# Step 1: Self-supervision (RVIC) step with target input
image_t, image_aux_t Split image_t

into even-odd vertical lines (columns) randomly

mask_t, mask_aux_t Split mask_t similar to above
GA True # Enable gated adapters
model.train(); # Set the model to training mode
comp_t model.Dec_aux ( model.Enc_and_GA ( image_t ) );
loss_t ( comp_t, image_aux_t, mask_aux_t );
loss_t.backward(); # Run the backward pass through encoder and Dec_aux
# Step 2 : Source RV image densification by filling its holes/missing pixels followed by UMT
model.eval(); # Set the model to evaluation mode
# Obtain dense source RV image from the auxiliary decoder
comp_s model.Dec_aux ( model.Enc_and_GA ( image_s ) );
# Fill in the holes/missing pixels in the source using the prediction from above step
mask_inv_s 1 - mask_s; # inverted mask with missing pixels == 1
image_s[mask_inv_s] comp_s[mask_inv_s]; # Filling the holes/missing pixels only
# Mask transfer from target to source
mask_t mask_t + mask_aux_t; # Combine the split masks again for target
mask_s mask_s mask_t; # Transfer target mask to source with elementwise product
label_s label_s mask_s; # Apply transferred mask on labels with elementwise product
image_s image_s mask_t; # Transfer target mask on images with elementwise product and broadcasting
# Step 3 : Full supervision with source input and its label
GA False # Disable gated adapters
model.train(); # Set the model to training mode
# Get the class prediction from mask transferred source RV image and compute supervised loss
pred_s model.Dec ( model.Enc ( image_s ) );
# Compute loss on the valid pixels from transferred source mask
loss_s ( pred_s, labels_s, mask_s );
loss_s.backward(); # Run the backward pass through encoder and decoder
Algorithm 1 PyTorch style training procedure (one batch iteration) of our UDA network

3.6 Evaluation and Post-processing

In inference mode, we do not require Dec_aux. In other words, we forward the RV image of a 3D point cloud from the test set of target domain dataset to the trained Enc, GA and Dec to obtain the network prediction.

A drawback of RV image-based projection representation is that multiple LiDAR points may get projected to same image pixel. This causes problem when the RV image is projected back to the original 3D point cloud space. To cope with this issue, we follow prior work cortinhal2020salsanext ; milioto2019rangenet++

and perform the kNN-based post-processing on the network output. Note that we employ this post-processing only during evaluation on the target domain samples and not during training.

4 Experiments

4.1 Settings

Datasets. We experiment adapting from real-to-real and synthetic-to-real scenarios for LiDAR semantic segmentation. The evaluation on synthetic-to-real scenario is very appealing since it is much easier to collect synthetic labeled dataset as opposed to the real labeled dataset. We experiment with nuScenes caesar2020nuscenes and SemanticKITTI behley2019semantickitti for real-to-real adaptation and GTA-LiDAR wu2019squeezesegv2 and real KITTI geiger2012we for synthetic-to-real adaptation.

nuScenes caesar2020nuscenes is a large-scale LiDAR point cloud segmentation dataset with 28,130 and 6,019 samples in training and validation sets, respectively. This dataset is captured using a 32-beam LiDAR sensor as opposed to 64-beam LiDAR for SemanticKITTI. Officially, this dataset comes with annotation for 16 semantic categories. However, to pair with SemanticKITTI for domain adaptation, we merge {Bus, Construction-vehicle, Ambulance, Police-vehicle, Trailer} into the Other-vehicle category.

SemanticKITTI behley2019semantickitti is a dataset created using 64-beam Velodyne HDL-64E laser scanner. Following behley2019semantickitti , we use sequence-{0-7, 9-10} (19,130 scans) for training and sequence-08 (4,071 scans) for evaluation. To pair with nuScenes for domain adaptation, we firstly use the 11 intersecting semantic classes between them. Additionally, we also rename and merge some of the semantic classes to pair with nuScenes for domain adaptation. In particular, {Bicycle, Bicyclist} and {Motorcycle, Motorcyclist} are merged into the Bicycle and Motorcycle categories, respectively. Person and Road are renamed to Pedestrian and Drivable_surface, respectively. {Vegetation, Trunk} are merged into Vegetation. {Building, Fence, Other-structure, Pole, Traffic-sign} are merged into Manmade. Note that SemanticKITTI has much more densely annotated scans (nearly points per scan) than nuScenes (nearly points per scan).

GTA-LiDAR wu2019squeezesegv2 is a synthetic dataset with only two categories (i.e., car and pedestrian). There are 121,087 scans in range image projection format. In our experiments, we use the last 9,087 samples for validation and the remaining for training.

KITTI geiger2012we ; wu2018squeezeseg contains 8,057 and 2,791 samples in training and testing, respectively. It has three semantic categories: Car, Pedestrian, and Cyclist. However, to pair with GTA-LiDAR, we ignore the Cyclist category which results in 7,930 samples for training and 2,653 samples for testing zhao2020epointda .

Evaluation Metric. Following prior work cortinhal2020salsanext

in LiDAR semantic segmentation, we use mean intersection over union (mIoU) as the evaluation metric that is given by

, where denote the set of points with the class prediction , denote the ground-truth point set for class and denote the cardinality of the set.

Implementation details. Our method is implemented using PyTorch with the same hyper-parameters as of SalsaNext cortinhal2020salsanext

. We employ the stochastic gradient descent (SGD) with a warmup scheduler as our optimizer. The initial learning rate, momentum, and weight decay are set to

, , and , respectively. We fix the batch size to for semanticKITTI and nuScenes, and for GTA-LiDAR and KITTI. We train our models on four NVIDIA Tesla V100 32GB GPUs. Note that following the SemanticKITTI behley2019semantickitti benchmark, we ignore the background class while training our real-to-real experiments, which is not the case for the synthetic-to-real setup wu2019squeezesegv2 . We take the weighted Cross-Entropy error as our supervised loss function, where the weight per class is set to the square root of the reciprocal class-frequency cortinhal2020salsanext .

4.2 Main Results for UDA

We examine the domain adaptation ability of our method on SemanticKITTI, nuScenes, GTA-LiDAR and KITTI datasets. Since there is lack of research on projection-based domain adaptation in 3D LiDAR point clouds, we compare with state-of-the-art UDA methods for 2D semantic segmentation. Specifically, we compare with CORAL sun2016deep , Advent vu2019advent , and SWD lee2019sliced . Additionally, we also compare with a baseline (to which we refer as Naive) where we directly evaluate the pre-trained model from source domain on target domain. For fair comparison, we re-implement the baseline methods with SalsaNext cortinhal2020salsanext as their backbone network. We report the results in Tables 1 and 2.

From Tables 1 and 2, we see a huge performance drop when we directly test the network trained on one domain to another (i.e., Naive method). This highlights the importance of domain adaptation in LiDAR semantic segmentation. However, our method improves the performance while outperforming the prior arts. We observe performance gain by our method on both the real-to-real (SemanticKITTI to nuScenes and vice-versa) and synthetic-to-real (GTA-LiDAR to KITTI) datasets.

Fig. 8 shows a few qualitative results when adapting from nuScenes to SemanticKITTI and synthetic GTA-LiDAR to KITTI. We can see the improvement in the prediction from our UDA method in comparison to the baseline method.

max width= Source Target Method Car Bicycle Motorcycle Other_vehicle Pedestrian Truck Drivable_surface Sidewalk Terrain Vegetation Manmade mIoU nuScenes nuScenes Supervised 84.5 22.7 66.4 63.3 59.5 72.7 96.4 73.4 74.0 85.4 87.9 71.5 semKITTI nuScenes Naive 35.7 0.2 0.4 5.7 7.5 8.1 73.8 15.0 14.9 8.3 51.4 20.1 CORALsun2016deep 51.0 0.9 6.0 4.0 25.9 29.9 82.6 27.1 27.0 55.3 56.7 33.3 MEnt vu2019advent 57.4 2.2 4.6 6.4 22.6 19.3 82.3 28.8 29.9 46.8 64.2 33.1 AEnt vu2019advent 57.4 1.1 8.6 6.7 24.0 10.1 81.0 25.4 26.6 34.2 58.9 30.4 (M+A)Ent vu2019advent 57.3 1.1 2.3 6.8 23.4 7.9 83.5 32.6 31.8 43.3 62.3 32.0 SWD lee2019sliced 45.3 2.1 2.2 3.4 25.9 10.6 80.7 26.5 30.1 43.9 60.2 30.1 Ours 54.4 3.0 1.9 7.6 27.7 15.8 82.2 29.6 34.0 57.9 65.7 34.5 semKITTI semKITTI Supervised 92.2 52.6 47.8 48.3 53.7 80.2 94.6 82.5 70.6 85.9 86.8 72.3 nuScenes semKITTI Naive 7.7 0.1 0.9 0.6 6.4 0.4 30.4 5.7 28.4 27.8 30.2 12.6 CORAL sun2016deep 47.3 10.4 6.9 5.1 10.8 0.7 24.8 13.8 31.7 58.8 45.5 23.2 MEnt vu2019advent 27.1 2.0 2.3 3.4 9.5 0.4 29.3 11.3 28.0 35.8 39.0 17.1 AEnt vu2019advent 42.4 4.5 6.9 2.8 6.7 0.7 16.1 7.0 26.1 46.1 42.0 18.3 (M+A)Ent vu2019advent 49.6 5.9 4.3 6.4 9.6 2.6 22.5 12.7 30.3 57.4 49.1 22.8 SWD lee2019sliced 34.2 2.7 1.5 2.0 5.3 0.9 28.8 20.5 28.3 38.2 36.7 18.1 Ours 49.6 4.6 6.3 2.0 12.5 1.8 25.2 25.2 42.3 43.4 45.3 23.5

Table 1: Comparison (IoU% for each class and mIoU%) with the state-of-the-art UDA methods for LiDAR semantic segmentation from semanticKITTI (semKITTI) to nuScenes and nuScenes to semKITTI. We also include the performance of fully-supervised method (Supervised) to indicate the upper-bound on the datasets.

max width=0.6 Source Target Method Car Pedestrian mIoU KITTI KITTI Supervised 66.9 28.0 47.4 GTA-LiDAR KITTI Naive 17.5 14.9 16.2 CORAL sun2016deep 34.8 22.4 28.6 MEnt vu2019advent 37.9 21.8 30.0 AEnt vu2019advent 34.3 23.0 28.6 (M+A)Ent vu2019advent 29.8 22.5 26.1 SWD lee2019sliced 32.1 29.6 30.8 Ours 51.0 29.3 40.2

Table 2: Comparison (IoU% for each class and mIoU%) for UDA from the synthetic GTA-LiDAR dataset to the real KITTI dataset.
Figure 8: Sample LiDAR semantic segmentation results on target datasets, namely, SemanticKITTI (left) and KITTI (right). The top, middle, and bottom rows contain the ground truth, baseline prediction with no adaptation (i.e. naive), and the prediction from our UDA framework, respectively. We show results on 2D RV image projections with each color denoting a different semantic class.

4.3 Ablation Study

We conduct an ablation study to incrementally examine the contribution and effectiveness of various modules in our UDA method. Table 3 shows that as we introduce modules in our method there is improvement in the mIoU measure. This study demonstrates that each module contributes to the UDA task.

max width=0.5 Method sK N N sK GTA K RVIC 23.5 17.5 17.1 RVIC+UMT 32.0 22.5 39.3 RVIC+UMT+GA 34.5 23.5 40.2

Table 3: Ablation study on different modules in our method. sK, N, GTA denote the sematicKITTI, nuScenes and GTA-LiDAR dataset, respectively. We report mIoU (%) in each cell. The last row indicates our final UDA method.

5 Conclusion

In this paper, we proposed an unsupervised domain adaptation (UDA) framework for LiDAR semantic segmentation. Our pipeline consists of a self-supervised auxiliary task that is coupled with an unpaired mask transfer mechanism to alleviate the impact of domain gap when adapting the model from source to target domain. Additionally, we also introduced light-weight gated adapter modules to inject target domain-specific evidence into the network. Extensive experiments on adapting from both real-to-real and synthetic-to-real datasets demonstrated our approach achieves state-of-the-art results.

6 Limitations and Societal Impact

The proposed UDA framework has a few limitations. First, it is tested on a limited number of object categories that are present in publicly available datasets. Second, it cannot deal with out-of-distribution object categories (i.e., categories that are not present in the dataset). Third, we only focused on LiDAR sensors corresponding to the datasets in this paper. However, in real-world there are many more LiDAR sensors with very different characteristics and installations.

Research presented in this paper could potentially improve the LiDAR semantic segmentation algorithms in autonomous driving systems, especially when they are exposed to different environments. However, the research needs further improvement (in terms of accuracy, robustness and safety) before it is integrated within an autonomous driving system.

References

  • [1] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2018.
  • [2] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [3] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.
  • [4] Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In International Symposium on Visual Computing, 2020.
  • [5] Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. Rangenet++: Fast and accurate lidar semantic segmentation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019.
  • [6] Inigo Alonso, Luis Riazuelo, Luis Montesano, and Ana C Murillo. 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation. IEEE Robotics and Automation Letters, 5(4):5432–5439, 2020.
  • [7] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall.

    Semantickitti: A dataset for semantic scene understanding of lidar sequences.

    In IEEE/CVF International Conference on Computer Vision, 2019.
  • [8] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [9] Eduardo R Corral-Soto, Amir Nabatchian, Martin Gerdzhev, and Liu Bingbing. Lidar few-shot domain adaptation via integrated cyclegan and 3d object detector with joint learning delay. In International Conference on Robotics and Automation, 2021.
  • [10] Iñigo Alonso, L. Riazuelo, L. Montesano, and A. C. Murillo. Domain adaptation in lidar semantic segmentation. arXiv preprint arXiv:2010.12239, 2020.
  • [11] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, 2016.
  • [12] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced wasserstein discrepancy for unsupervised domain adaptation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [13] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [14] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
  • [15] Yuxing Xie, Jiaojiao Tian, and Xiao Xiang Zhu. Linking points with labels in 3d: A review of point cloud semantic segmentation. IEEE Geoscience and Remote Sensing Magazine, 8(4):38–59, 2020.
  • [16] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [17] Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In IEEE International Conference on Robotics and Automation, 2018.
  • [18] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In

    International Conference on Machine Learning

    , 2015.
  • [19] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.

    Domain-adversarial training of neural networks.

    Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [20] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [21] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, 2018.
  • [22] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In International Conference on Robotics and Automation, 2019.
  • [23] Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, 2019.
  • [24] Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emilie Wirbel, and Patrick Pérez. xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [25] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 2017.
  • [26] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [28] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [29] Sicheng Zhao, Yezhen Wang, Bo Li, Bichen Wu, Yang Gao, Pengfei Xu, Trevor Darrell, and Kurt Keutzer. epointda: An end-to-end simulation-to-real domain adaptation framework for lidar point cloud segmentation. arXiv preprint arXiv:2009.03456, 2020.