A Learnable Self-supervised Task for Unsupervised Domain Adaptation on Point Clouds

04/12/2021 ∙ by Xiaoyuan Luo, et al. ∙ FUDAN University 0

Deep neural networks have achieved promising performance in supervised point cloud applications, but manual annotation is extremely expensive and time-consuming in supervised learning schemes. Unsupervised domain adaptation (UDA) addresses this problem by training a model with only labeled data in the source domain but making the model generalize well in the target domain. Existing studies show that self-supervised learning using both source and target domain data can help improve the adaptability of trained models, but they all rely on hand-crafted designs of the self-supervised tasks. In this paper, we propose a learnable self-supervised task and integrate it into a self-supervision-based point cloud UDA architecture. Specifically, we propose a learnable nonlinear transformation that transforms a part of a point cloud to generate abundant and complicated point clouds while retaining the original semantic information, and the proposed self-supervised task is to reconstruct the original point cloud from the transformed ones. In the UDA architecture, an encoder is shared between the networks for the self-supervised task and the main task of point cloud classification or segmentation, so that the encoder can be trained to extract features suitable for both the source and the target domain data. Experiments on PointDA-10 and PointSegDA datasets show that the proposed method achieves new state-of-the-art performance on both classification and segmentation tasks of point cloud UDA. Code will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point clouds contain abundant spatial geometric information and have become an indispensable 3D data representation in computer vision. With the continuous advancement of range sensors, such as depth cameras and LIDAR, more and more 3D point clouds are captured and processed in many different applications, including automated driving

[19], human-computer interactions [13] and augmented reality [40]. Recently, deep neural networks have exhibited excellent performance in supervised tasks on point clouds, such as classification [23, 10], segmentation [18, 37, 31] and registration [3, 33, 8]. In the supervised learning scheme, manual labeling of a large amount of point clouds is needed for model training. However, manual labeling is often extremely expensive and time-consuming, especially when we need to label point clouds for segmentation. Therefore, increasing attention has been attracted to techniques that help alleviate the dependence on labeled data.

Figure 1:

Destruction-reconstruction self-supervised task on point clouds. The original point cloud is first destroyed by some manner and then a deep neural network is trained to reconstruct the original point cloud from the destroyed ones. The red points indicate the part of the point cloud that is destroyed. (a) Hand-crafted destruction: replace a part of the point cloud with new points sampled from Gaussian distribution. (b) Proposed learnable destruction: local nonlinear transformation by a deep neural network.

Unsupervised domain adaptation (UDA) is one of the methods to address the problem of label-scare dataset, and it aims to leverage labeled source domain data and unlabeled target domain data to train a model that works in the target domain. Because of the inevitable distribution discrepancy between the source domain and the target domain, the model trained only in the source domain usually does not work well in the target domain. In 2D imaging field, many widely used domain adaptation (DA) methods have been proposed [6], where domain adaptation is usually achieved by globally matching the feature distributions of the source and the target domains [17]. However, in the field of 3D point cloud processing, the feature distributions learned from different domains are quite different [2], and the DA methods designed for 2D images such as MCD [4] and DANN [9] cannot be transferred directly to point clouds. Recently, a number of studies have emerged for point cloud UDA by learning domain-invariant features. As mentioned in [25], local features of point clouds are more transferable than global features, so the authors proposed using adaptive nodes to align the local features between the source and the target domains. However, the experimental results on the point cloud domain adaptation benchmark PointDA-10 [25] were not satisfactory. In contrast, [2] and [1] proposed utilizing self-supervised tasks to help capture highly transferable feature representations, and they achieved higher classification accuracy than [25] on PointDA-10, thereby demonstrating the effectiveness of integrating self-supervised tasks in point cloud UDA.

In this paper, we propose a novel learnable self-supervised task and integrate it into a self-supervision-based point cloud UDA architecture [1, 38, 28] to achieve improved performance. The UDA architecture contains a main network for the main supervised task and an auxiliary network for the auxiliary self-supervised task. The two networks share an encoder but have different heads to perform the two tasks separately. The labeled data in the source domain are used to train the main network, and the data in both the source domain and the target domain are used together to train the auxiliary network, where the pseudo labels used for training are generated automatically. Therefore, the encoder is trained to extract features from data in both the source domain and the target domain so that it can be effectively transferred to the target domain. In this architecture, how well the encoder can be transferred to the target domain heavily depends on the self-supervised task. For this reason, we propose a learnable nonlinear transformation on point clouds and use it to design a destruction-reconstruction self-supervised task, as shown in Figure 1. Destruction-reconstruction is a commonly used strategy for designing self-supervised tasks, but all existing studies have explored hand-crafted methods to destroy the point cloud, such as the one shown in Figure 1 (a) [2]. In contrast, as shown in Figure 1 (b), we propose a learnable transformation based on a deep neural network, which destroys a point cloud by applying nonlinear transformation on a part of it. Through adversarial training, the network is able to learn a continuous nonlinear transformation to generate highly abundant and complicated point clouds while retaining the semantic information. By reconstructing the original point cloud from these locally transformed ones, the encoder can learn to extract local features for both the source domain and the target domain data.

Our main contributions are summarized as follows:

  • We propose a novel learnable transformation on point clouds and construct a learnable point cloud destruction-reconstruction self-supervised task. Compared with existing hand-crafted self-supervised tasks, our method can learn more transferable cross-domain features to mitigate the distribution discrepancy. To our best knowledge, this work is the first learnable self-supervised task for both the point cloud processing and the broader CV field.

  • We apply the proposed learnable self-supervised task in point cloud UDA and developed a multi-region destruction strategy. Destructing point clouds in different regions and then reconstructing the original point clouds encourages the encoder in the UDA architecture to focus on local features, which is beneficial to domain adaptation.

  • The proposed method is evaluated on PointDA-10 and PointSegDA datasets for point cloud classification and segmentation UDA, respectively, and it achieves new state-of-the-art performance on both tasks.

2 Related Woks

2.1 Deep Learning on Point Clouds

With the unprecedented success achieved by deep neural networks in 2D visual tasks, many deep learning techniques have been proposed for 3D point clouds. A major difference between 3D point clouds and 2D images is that point clouds are unordered sets, so the networks for point clouds should be permutation invariant. A series of architectures have been proposed to process 3D point clouds, such as PointNet

[23], PointNet++ [24], DGCNN [34], etc. Interested readers are referred to a latest survey [28] for details.

2.2 Domain Adaptation on Point Clouds

Domain adaptation is a branch of transfer learning and has long been an important issue in machine learning

[22]. In UDA, the source domain and the target domain tasks are the same, but the data distributions of the two domains are different [20]. Meanwhile, we only have labels for the source domain data. The objective of UDA is to train a model that can be used directly on target domain data.

With the rapid development of deep learning techniques in computer vision, deep learning based DA was first developed for 2D images [32] and subsequently for 3D point clouds. Some studies [26, 35]

applied 2D image DA techniques to 3D point clouds, losing large amounts of 3D geometric information, which is crucial for understanding 3D shapes. In addition, most general-purpose DA methods for 2D images merely perform global feature alignment without utilizing local geometric information, impeding their applications on 3D point clouds. Two types of approaches have been proposed to utilize local information in point cloud DA. One type aligns the local and global features at the same time, and the other type utilizes the powerful feature extraction capability of self-supervised learning to capture more effective local information. PointDAN

[25] belongs to the first type, and it jointly aligns the global and local features of source and target domains at multiple scales, but its performance on PointDA-10 is not very good. In the second type, self-supervised tasks are applied to capture abundant local and global features. The effectiveness of self-supervised learning in promoting DA has been demonstrated in 2D images [38, 28], and it has also been used for point cloud UDA. As the first attempt of this approach, [1] introduces a destruction-reconstruction self-supervised auxiliary task for point cloud UDA, and the destruction is made by replacing several randomly selected regions with new points sampled from an isotropic Gaussian distribution. [2] utilizes similar UDA architecture as [1] but proposes a new self-supervised 3D puzzle auxiliary task for UDA. The above studies show the effectiveness of integrating self-supervised task in point cloud UDA. In this work, we follow this line of work and propose a new self-supervised task for better domain adaptation.

Figure 2:

The proposed point cloud UDA framework. (a) The framework of the domain adaptation network based on self-supervised learning, where the red and blue arrows represent data flows from the source and target domains, respectively. (b) A learnable point cloud transformation network. (c) A self-supervised point cloud reconstruction task based on multi-region transformation.

2.3 Self-Supervised Learning on Point Clouds

Self-supervised learning utilizes pseudo labels generated from the data itself to train a learning model. It has long been a research focus in the field of machine learning and the readers are referred to [15] for comprehensive review. Recently, there has been a series of work on point cloud self-supervised learning, most of which focus on proposing new self-supervised auxiliary tasks to improve performance on main tasks. There are several strategies commonly used to design self-supervised tasks. The first strategy is to train a neural network to learn some features that can also be directly calculated from the point clouds. For example, [11] introduced half-to-half point cloud prediction self-supervised task in a multi-angle scenario, where RNN was explored to predict the back half point cloud based on the front half. In [29], a novel geometric self-supervised learning task was proposed to predict point cloud local geometric information like normal and curvature. The second strategy is to first destroy a point cloud and then train a neural network to reconstruct it. Both [27] and [2] proposed self-supervised tasks agnostic to network architecture, in which a network is trained to reconstruct point clouds whose parts have been randomly rearranged. [1] suggested simulating noise data and occlusion in real-world applications to construct the self-supervised reconstruction task, which is performed by replacing random selected regions of an original point cloud with new points sampled from a Gaussian distribution and then reconstructing the original point cloud. The third strategy is to design self-supervised tasks based on contrastive learning. For example, [41] cut each point cloud into two parts and constructed contrastive learning by predicting whether the two parts belong to the same object. [14] leveraged different modalities including image, point cloud, mesh and images from different views for the same 3D objects and they trained a model to differentiate positive pairs constructed from the same object and negative pairs constructed from different objects to make the model learn modal and view invariant features. There are other techniques to design self-supervised tasks. For example, [30] transformed unstructured point clouds to sequences by space filling Morton-order curve and proposed a self-supervised task in which a multi-layer RNN network was employed to predict the next point in the sequence. [12] generated pseudo labels through clustering in feature space and combined self-supervised classification task with unsupervised clustering and reconstruction task to train a multi-scale encoder for point cloud and shape feature extraction. Nevertheless, in all these studies, the self-supervised tasks are hand-crafted, while in this paper, we propose the first learnable self-supervised task and show superior performance in point cloud UDA.

3 Method

In this section, we first present the problem formulation of point cloud UDA and the UDA framework based on self-supervised task. Then, we explain the proposed point cloud destruction-reconstruction self-supervised task in detail, where point clouds are destroyed by a learnable nonlinear transformation. Finally, we propose a multi-region destruction strategy to extract more transferable domain-variant features. Please note that the problem formulation and the method are presented with point cloud classification as the main task, and the idea can be easily extended to point cloud segmentation.

3.1 Problem Formulation

In UDA for point cloud classification, we have a source domain with labeled point clouds and a target domain with unlabeled point clouds. Each point cloud has its label in the source domain. The classes of the target domain are the same with those of the source domain, but the target domain sample

is unlabeled. The objective is to train a classification network that generalizes well for classifying the target domain point clouds.

3.2 UDA Framework Based on Self-Supervised Auxiliary Task

Figure 3:

Point cloud transformation network. The blocks with different colors represent point cloud features, the dashed lines represent skip connections, and the values in the shared MLP module represent the numbers of neurons. Note that the batch normalization layer is applied after each layer of shared MLP, and ReLU is used as the activation function.

The architecture of utilizing self-supervised tasks for UDA in [2] [1] can be summarized as the framework shown in Figure 2 (a), which is also used in this study. The framework consists of a network for the main task of point cloud classification and an auxiliary network for the self-supervised auxiliary task, where and have their own heads and , respectively, but share the same encoder . The network is trained by the main task loss , which is defined as the cross-entropy loss between predicted label and the corresponding ground truth label on the source domain data. For the auxiliary network, both the source domain data and the target domain data are utilized for self-supervised learning. First, the original point clouds and are processed by a function , and their pseudo labels and for the auxiliary self-supervised task can be generated from the original data and . Then, the auxiliary network is trained in a supervised manner with inputs as and and their corresponding labels and . The loss to train the auxiliary network is denoted as . Hence, the final optimization objective is:

(1)

where the first term is the loss of the main task on the source domain data, and the second and the third terms are the loss of the self-supervised auxiliary task on the source domain data and the target domain data, respectively.

In the above architecture, the purpose of performing the auxiliary task is to learn transferable features by the encoder for the target domain, and this is achieved by training the encoder to extract features from both the source domain and the target domain data. In addition, local features tend to be more transferable across different point cloud datasets than global features. For instance, the overall shape of an aircraft varies greatly across datasets, but the differences between wings are relatively small, so the designed self-supervised task should encourage the encoder to focus on local features [25]. In [1], Achituve et al. proposed transforming a portion of the point clouds, forcing the encoder to extract local features from the remaining component of the point cloud to perform the point cloud reconstruction task. Similarly, the 3D puzzle problem constructed in [2] also enables the encoder to pay attention to local features. Following this idea, we propose a learnable self-supervised task to make the encoder focus more on local regions than on the whole point cloud.

3.3 Self-Supervised Task Using A Learnable Transformation

Many self-supervised tasks on point clouds rely on hand-crafted transformations of the point clouds, and different transformations will lead to different features being learned by the encoder. Transformations explored in literature include adding noise [1], removing a part of the point cloud [1], shuffling the subblocks of a point cloud [2], and so on. However, hand-crafted transformations are limited and nonflexible, and this may restrict the effectiveness of the trained encoder. In this section, we propose a learnable transformation based on deep neural networks, as shown in Figure 2 (b).

We construct a transformation network parameterized by to transform the original point cloud into . The auxiliary network is applied to reconstruct the original point cloud from , and the reconstructed point cloud is denoted as :

(2)
(3)

The objective of the transformation network is to maximize the chamfer distance between the original point cloud and the transformed point cloud, and the objective of the auxiliary reconstruction network, which consists of the shared encoder and the head , is to minimize the chamfer distance between the reconstructed point cloud and the original point cloud. So, the optimization objective of the auxiliary self-supervised task in formula (3) becomes:

(4)

where and are hyper-parameters, and represents the chamfer distance between two point clouds. Training the transformation network and the reconstruction network in such an end-to-end way would enable to learn a nonlinear and continuous transformation. Meanwhile, the encoder would also learn more robust features to reconstruct the original point cloud from the transformed one, thereby improving the transferability of the learned features and domain adaptability of the encoder.

Figure 4: Examples of point clouds in the Point-DA dataset.

The transformation network is composed of multiple MLP layers with shared parameters as shown in Figure 3

. Sharing parameters across all points ensures that neighboring points in the input point cloud get similar outputs after transformation so that the learned transformation is continuous and the global semantic information of the point cloud is maintained. Instead of directly outputting the coordinates of the points as in formula (4), we output the displacement of each point through the point cloud transformation network and control the scale of displacement by a shift scale hyperparameter

as follows:

(5)

3.4 Multi-Region Transformation

Experiments show that transforming the whole point cloud will make it very difficult to reconstruct the original point cloud, which may be due to the loss of most local and contextual information during transformation. To avoid this issue, we introduce a multi-region local point cloud transformation strategy based on our proposed learnable transformation network.

As shown in Figure 2 (c) , we employ a local transformation for the point cloud, so that the encoder can learn features from both the transformed and the untransformed regions. For one point cloud , multiple transformation networks with shared parameters are used to transform parts of the point cloud at different random locations to obtain several different transformed point clouds of the same object, and then the same reconstruction network is applied to reconstruct the original point cloud from the transformed ones. This strategy facilitates the encoder to extract features from different parts of the objects for the reconstruction task.

Method ModelNet to ModelNet to ShapeNet to ShapeNet to ScanNet to ScanNet to Avg
ShapeNet ScanNet ModelNet ScanNet ModelNet ShapeNet
w/o Adapt 80.2 43.1 75.8 40.7 63.2 67.2 61.7
Rotate 81.6 48.2 64.6 49.0 48.0 63.0 59.1
PointDAN 80.2 45.3 71.2 46.9 59.8 66.2 61.6
DefRec 80.0 46.0 68.5 41.7 63.0 68.2 61.2
DefRec+PCM 81.1 50.3 54.3 52.8 54.0 69.0 60.3
Resort 81.6 49.7 73.6 41.9 65.9 68.1 63.5
Ours 82.5 52.7 73.8 53.8 67.4 69.0 66.5
Table 1: The classification accuracy (%) on the PointDA-10 dataset with PointNet as the encoder.
Method ModelNet to ModelNet to ShapeNet to ShapeNet to ScanNet to ScanNet to Avg
ShapeNet ScanNet ModelNet ScanNet ModelNet ShapeNet
w/o Adapt 81.7 42.9 72.2 44.2 67.3 65.1 62.2
Rotate 83.0 51.6 72.5 41.0 67.1 70.3 64.3
PointDAN 83.9 44.8 63.3 45.7 43.6 56.4 56.3
RS 81.5 35.2 71.9 39.8 61.0 63.6 58.8
DefRec 83.3 46.6 79.8 49.9 70.7 64.4 65.8
DefRec+PCM 81.7 51.8 78.6 54.5 73.7 71.1 68.6
Ours 82.8 56.3 81.7 54.8 72.9 71.7 70.0
Table 2: The classification accuracy (%) on the PointDA-10 dataset with DGCNN as the encoder.

4 Experiment

4.1 Point Cloud Classification UDA

Dataset. We evaluated our method on the PointDA-10 dataset [25] specifically designed for point cloud DA; this dataset contains 10 shared classes from three widely used point cloud datasets: ModelNet [36], ShapeNet [5] and ScanNet [7], as shown in Figure 4. The point clouds in ModelNet and ShapeNet were sampled from 3D CAD models, while point clouds in ScanNet were sampled from scanned and reconstructed real-world indoor scenes. As in [1], we rotate the ScanNet and ShapeNet models around their axis, so that in all three datasets, the upward direction of each model corresponds to the positive direction of axis. There are obvious domain gaps between the three datasets, as shown in Figure 4. The point clouds in both ModelNet and ShapeNet are complete surface points of 3D objects, but the objects in the two sets have different shapes and styles, while the point clouds in ScanNet are partial surface points with noise. UDA experiments were performed by choosing one of them as the source domain and another as the target domain, and this resulted in six UDA scenarios, where the performance was measured in terms of classification accuracy on the target domain. We split the official training set with 80% for training and 20% for validation, and used the official test sets for testing.

Implementation Details

. Our model was trained on a single NVIDIA V100 GPU based on the deep learning library PyTorch

[21]. The ADAM [16]

optimizer with a learning rate of 0.001 and a weight decay of 0.0005 was applied under a cosine annealing learning rate scheduler. Early-stop mechanism was applied on validation dataset to avoid over-fitting. The batch size and number of epochs were set as 16 and 150, respectively. During training, we applied rotations about the

axis and random jittering with standard deviation and clip parameters of 0.01 and 0.02, respectively, for data augmentation. The model with the highest classification accuracy on the source domain validation dataset was preserved to evaluate the performance of that model on the target domain test dataset.

Network Options. Any point cloud encoding network can be used as the encoder in our method, and we experimented on two networks: PointNet and DGCNN. The main task head network and the point cloud reconstruction head network are both fully connected networks. In the multi-region point cloud transformation strategy, we constructed two transformation networks with shared parameters to transform the 512 nearest neighbors of a randomly selected point with a shift scale of 0.05. Since the point cloud transformation network is easier to train than the reconstruction network, we set and as 1 and 10, respectively.

Random Cropping. ModelNet and ShapeNet contain point clouds of complete surfaces of objects, while ScanNet contains scanned and reconstructed real-world point clouds, which are often incomplete surfaces of objects. Due to the shape differences between them, we applied a random cropping strategy as in [39] for data augmentation when ScanNet was chosen as the target domain, where the point clouds of the source domain were randomly cropped by a plane with a random direction, and 70% of the point cloud was retained.

Classification Results. We compared our proposed method with other competitive point cloud UDA methods, including PointDAN [25], DefRec [1], Resort [2] and RS [27]. In addition, we compared with another two baselines methods. The first baseline is to train the main task network with the source domain data and then directly use it on the target domain without adaptation. The second baseline utilizes the same architecture as our proposed method but uses rotation prediction as the self-supervised task, and it is denoted as Rotate. In addition, these UDA methods are encoder-independent, so we compare the results with either PointNet or DGCNN as the encoder to evaluate the generality of the comparing UDA methods. The results with PointNet and DGCNN as the encoder are shown in Table 1 and Table 2, respectively. We reproduced PointDAN’s results. Because the code of Resort [2] is not publicly available, we only use the results reported in the original paper with PointNet as the encoder. The results of DefRec and RS are from [1], and PCM refers to an extension with data augmentation adopted in [1].

Our proposed method achieves the highest average classification accuracy with both encoders. In Table 1 with PointNet as the encoder, our method achieves the highest accuracy on five out of six UDA scenarios, and w/o Adapt achieves the highest accuracy on another scenario. In Table 2 with DGCNN as the encoder, our method achieves the highest classification accuracy on four out of six UDA scenarios, while DefRec + PCM achieves the highest accuracy on one scenario and PointDAN achieves the highest accuracy on one scenario. With PointNet and DGCNN as the encoder, the average accuracy of our method is 4.8% and 7.8% higher than the baseline without adaptation, which indicates the efficiency of the proposed adaptation method. By comparing the results of the proposed method and Rotate, we can see that the proposed destruction-reconstruction self-supervised task is more effective than the rotation prediction task. By comparing the results of Table 1 and Table 2, we can find that our proposed method is more agnostic to different encoders.

SSL Multi Crop DGCNN ModelNet to ModelNet to ShapeNet to ShapeNet to ScanNet to ScanNet to Avg
ShapeNet ScanNet ModelNet ScanNet ModelNet ShapeNet
80.2 43.1 75.8 40.7 63.2 67.2 61.7
82.6 48.8 74.0 44.7 63.1 67.9 63.5
82.5 47.8 73.8 46.2 67.4 69.0 64.5
82.5 52.7 73.8 53.8 67.4 69.0 66.5
82.8 46.7 81.7 51.3 72.9 71.7 67.9
80.6 52.2 72.1 50.5 63.1 68.4 64.5
82.8 56.3 81.7 54.8 72.9 71.7 70.0
Table 3: Ablation study

4.2 Point Cloud Segmentation UDA

Dataset and Implementation. We then evaluated our method on PointSegDA dataset proposed in [1] for point cloud segmentation UDA. The dataset contains human body point clouds collected from four datasets: ADOBE, FAUST, MIT and SCAPE. The point clouds in these datasets have different body poses and shapes, but they are all segmented into the same eight parts. UDA experiments were performed by choosing one of them as the source domain and another as the target domain, which resulted in 12 UDA scenarios. The performance was measured in terms of mean Intersection over Union (IoU) on the target domain. We split the training dataset with 80% for training and 20% for validation, and used the official test sets for testing. DGCNN was used as the encoder in this experiment and the segmentation head was implemented using four 1D convolutional layers with size [256, 256 ,128, 8].

Segmentation Results. We compare our method with the baseline without adaptation, RS[27] and DefRec [1]. The results are shown in Table 4. Our method achieved the highest average IoU and the highest IoUs in eight out of 12 UDA scenarios. Compared with the baseline without adaptation, we can observe that significant improvement is achieved in most scenarios after employing our domain adaption method.

Method FAUST to FAUST to FAUST to MIT to MIT to MIT to ADOBE to ADOBE to ADOBE to SCAPE to SCAPE to SCAPE to Avg
MIT ADOBE SCAPE FAUST ADOBE SCAPE FAUST MIT SCAPE FAUST MIT ADOBE
w/o Adapt 60.9 78.5 66.5 33.6 26.6 69.9 38.5 31.2 30.0 64.5 74.1 68.4 53.6
RS 60.7 78.7 66.9 38.4 59.6 70.4 44.0 30.4 36.6 65.3 70.7 73.0 57.9
DefRec 61.8 79.7 67.4 40.1 67.1 72.6 42.5 28.9 32.2 66.2 66.4 72.2 58.1
DefRec+PCM 60.9 78.8 63.6 48.6 48.1 70.1 46.9 33.2 37.6 62.6 66.3 66.5 56.9
Ours 61.8 80.3 68.5 56.6 60.8 67.8 52.3 38.6 41 66.6 67.4 68.0 60.8
Table 4: The mean IoU on PointSegDA dataset with DGCNN as the encoder.

4.3 Ablation Study

To verify the effects of different modules proposed in our method, we conduct a detailed ablation study, and the quantitative results are summarized in Table 3. SSL represents applying a self-supervised task to train the network, Multi represents applying the multi-region transformation strategy, Crop represents applying the random cropping data augmentation process when the target domain is ScanNet, and DGCNN represents using DGCNN instead of PointNet as the encoder. The first row shows the results obtained by the baseline model without adaptation.

From Table 3, we can see that even when applying the proposed self-supervised task alone, our average classification accuracy can reach 63.5%, which is higher than those of the baseline and PointDAN [25]. In this situation, PointNet is the encoder, and the classification accuracy achieved by using the proposed self-supervised network is higher than that obtained using the deformation-based self-supervised task in DefRec [1] and on par with the results in Resort [2]. This result shows that our self-supervised task can indeed handle the domain distribution alignment more elaborately than the other methods, thus achieving better domain adaptation. After adopting the multi-region transformation strategy, the accuracy is further improved to 64.5%. In the case with ScanNet as the target domain, the random cropping augmentation for the source domain data reduces the domain gap between the source domain and the target domain, thus improving the accuracy of ModelNet to ScanNet and ShapeNet to ScanNet by 4.9% and 7.6%, respectively. In this way, the obtained average accuracy (66.5%) becomes the highest among those of all competitive methods with PointNet as the encoder. Finally, when DGCNN is applied as the encoder, our method achieves the state-of-the-art average accuracy of 70.0%.

In addition, we also experimented on using Crop for the DefRec+PCM method, but its performance changes little. Concretely, the classification accuracy for the ModelNet to ScanNet decreased from 55.5% to 54.4%, and the classification accuracy for the ShapeNet to ScanNet increased from 53.4% to 53.7%.

4.4 Learned Transformation as Data Augmentation

In this work, we propose a learnable self-supervion task, in which a transformation is learned to non-linearly transform a part of a point cloud. Actually, besides being used in the domain adaptation framework, the learned transformation can also be used as a data augmentation strategy. In this section, we conduct the following experiment to study if it is an effective data augmentation strategy and how is its performance when compared to the domain adaptation strategy. In this experiment, PointNet is used as the encoder.

For each of the six domain adaptation scenarios, we do the following experiment. First, we learn a non-linear transformation by using the destruction-reconstruction self-supervised task on on the source domain data. Then, we train a classification network on the source domain data, in which the learned transformation is used as data augmentation. Finally, the learned classification network is used directly on classify the target domain point cloud. We also experiment on using the transformation in [2] as augmentation. The results are shown in Table 5. In most scenarios, our data augmentation method achieves equal or higher accuracy than [2] or without augmentation. This indicates that the transformation does improve the generalization ability of the trained model when it is used as data augmentation. However, when comparing with the results in Table 1, we can see that the domain adaptation strategy is better than the data augmentation strategy (Avg accuracy 66.5% vs. 63.0%).

Method ModelNet to ModelNet to ShapeNet to ShapeNet to ScanNet to ScanNet to Avg
ShapeNet ScanNet ModelNet ScanNet ModelNet ShapeNet
w/o Adapt 80.2 43.1 75.8 40.7 63.2 67.2 61.7
DefRec 81.0 44.9 73.0 42.2 61.7 68.8 61.9
Ours 81.2 43.1 75.1 43.2 66.7 68.8 63.0
Table 5: The classification accuracy (%) on the PointDA-10 dataset

5 Conclusion

To address the UDA problem for point clouds, we propose a novel learnable self-supervised task that helps the adapted neural network extract transferable features. Specifically, we propose a learnable point cloud transformation and use it in a point cloud destruction-reconstruction self-supervised auxiliary task, and we apply it in a UDA framework with a multitask learning architecture. We train the main task network and the auxiliary task network, which share an encoder, so that the encoder extracts features that are highly transferable to the target domain. We further propose a multi-region transformation strategy to make the network focus on local features, which are more transferable. New state-of-the-art performance is achieved on the point cloud classification UDA benchmark PointDA-10 and point cloud segmentation UDA benchmark PointSegDA. We think that the proposed learnable self-supervised task can be applied in other self-supervised learning and semi-supervised learning studies.

References

  • [1] I. Achituve, H. Maron, and G. Chechik (2021) Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 123–133. Cited by: §1, §1, §2.2, §2.3, §3.2, §3.2, §3.3, §4.1, §4.1, §4.2, §4.2, §4.3.
  • [2] A. Alliegro, D. Boscaini, and T. Tommasi (2020) Joint supervised and self-supervised learning for 3d real-world challenges. arXiv preprint arXiv:2004.07392. Cited by: §1, §1, §2.2, §2.3, §3.2, §3.2, §3.3, §4.1, §4.3, §4.4.
  • [3] Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey (2019) Pointnetlk: robust & efficient point cloud registration using pointnet. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 7163–7172. Cited by: §1.
  • [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Schölkopf, and A. J. Smola (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), pp. e49–e57. Cited by: §1.
  • [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.1.
  • [6] G. Csurka (2017) A comprehensive survey on domain adaptation for visual applications. In Domain adaptation in computer vision applications, pp. 1–35. Cited by: §1.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §4.1.
  • [8] G. Elbaz, T. Avraham, and A. Fischer (2017) 3D point cloud registration for localization using a deep neural network auto-encoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640. Cited by: §1.
  • [9] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In International conference on machine learning, pp. 1180–1189. Cited by: §1.
  • [10] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys (2017) Semantic3d. net: a new large-scale point cloud classification benchmark. arXiv preprint arXiv:1704.03847. Cited by: §1.
  • [11] Z. Han, X. Wang, Y. Liu, and M. Zwicker (2019) Multi-angle point cloud-vae: unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10441–10450. Cited by: §2.3.
  • [12] K. Hassani and M. Haley (2019) Unsupervised multi-task feature learning on point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8160–8171. Cited by: §2.3.
  • [13] D. Im, S. Kang, D. Han, S. Choi, and H. Yoo (2020)

    A 4.45 ms low-latency 3d point-cloud-based neural network processor for hand pose estimation in immersive wearable devices

    .
    In 2020 IEEE Symposium on VLSI Circuits, pp. 1–2. Cited by: §1.
  • [14] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian (2020) Self-supervised modal and view invariant feature learning. arXiv preprint arXiv:2005.14169. Cited by: §2.3.
  • [15] L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.3.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [17] W. M. Kouw and M. Loog (2019) A review of domain adaptation without target labels. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • [18] L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: §1.
  • [19] Y. Li, L. Ma, Z. Zhong, F. Liu, M. A. Chapman, D. Cao, and J. Li (2020) Deep learning for lidar point clouds in autonomous driving: a review. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1.
  • [20] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §2.2.
  • [21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §4.1.
  • [22] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa (2015) Visual domain adaptation: a survey of recent advances. IEEE signal processing magazine 32 (3), pp. 53–69. Cited by: §2.2.
  • [23] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §2.1.
  • [24] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §2.1.
  • [25] C. Qin, H. You, L. Wang, C. J. Kuo, and Y. Fu (2019) PointDAN: a multi-scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, pp. 7192–7203. Cited by: §1, §2.2, §3.2, §4.1, §4.1, §4.3.
  • [26] K. Saleh, A. Abobakr, M. Attia, J. Iskander, D. Nahavandi, M. Hossny, and S. Nahvandi (2019) Domain adaptation for vehicle detection from bird’s eye view lidar point cloud data. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §2.2.
  • [27] J. Sauder and B. Sievers (2019) Self-supervised deep learning on point clouds by reconstructing space. In Advances in Neural Information Processing Systems, pp. 12962–12972. Cited by: §2.3, §4.1, §4.2.
  • [28] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros (2019) Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825. Cited by: §1, §2.1, §2.2.
  • [29] L. Tang, K. Chen, C. Wu, Y. Hong, K. Jia, and Z. Yang (2020) Improving semantic analysis on point clouds via auxiliary supervision of local geometric priors. arXiv preprint arXiv:2001.04803. Cited by: §2.3.
  • [30] A. Thabet, H. Alwassel, and B. Ghanem (2019) Mortonnet: self-supervised learning of local features in 3d point clouds. arXiv preprint arXiv:1904.00230. Cited by: §2.3.
  • [31] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan (2019) Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10296–10305. Cited by: §1.
  • [32] M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §2.2.
  • [33] Y. Wang and J. M. Solomon (2019) Deep closest point: learning representations for point cloud registration. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3523–3532. Cited by: §1.
  • [34] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §2.1.
  • [35] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation, pp. 4376–4382. Cited by: §2.2.
  • [36] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §4.1.
  • [37] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: §1.
  • [38] J. Xu, L. Xiao, and A. M. López (2019) Self-supervised domain adaptation for computer vision tasks. IEEE Access 7, pp. 156694–156706. Cited by: §1, §2.2.
  • [39] Z. J. Yew and G. H. Lee (2020) RPM-net: robust point matching using learned features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11824–11833. Cited by: §4.1.
  • [40] L. Zhang, P. van Oosterom, and H. Liu (2020) Visualization of point cloud models in mobile augmented reality using continuous level of detail method. The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 44, pp. 167–170. Cited by: §1.
  • [41] L. Zhang and Z. Zhu (2019)

    Unsupervised feature learning for point cloud by contrasting and clustering with graph convolutional neural network

    .
    arXiv preprint arXiv:1904.12359. Cited by: §2.3.