Three-dimensional scene understanding is required in numerous applications, in particular robotics, autonomous driving and virtual reality. Among the different tasks under concern, 3D semantic segmentation is gaining more and more traction as new datasets are being released[7, 1, 5]. Like other perception tasks, 3D semantic segmentation can encounter the problem of domain shift between supervised training and test time, for example between day and night, different countries or datasets. Domain adaptation aims at addressing this gap, but existing work concerns mostly 2D semantic segmentation [12, 34, 27, 17] and rarely 3D . We also observe that previous domain adaptation work focuses on single modality, whereas 3D datasets are often multi-modal, consisting of 3D point clouds and 2D images. While the complementarity between these two modalities is already exploited by both human annotators and learned models to localize objects in 3D scenes [18, 20], we consider it through a new angle, asking the question: If 3D and 2D data are available in the source and target domain, can we capitalize on multi-modality to address Unsupervised Domain Adaptation (UDA)?
We coin our method Cross-Modal UDA, ‘xMUDA’ in short, and consider 3 real-to-real adaptation scenarios with different lighting conditions (day-to-night), environments (country-to-country) and sensor setup (dataset-to-dataset). It is a challenging task for various reasons. The heterogeneous input spaces (2D and 3D) make the pipeline complex as it implies to work with heterogeneous network architectures and 2D-3D projections. In fusion, if two sensors register the same scene, there is shared information between both, but each sensor also has private (or exclusive) information. Thanks to the latter, one modality can be stronger than the other in a certain case, but it can be the other way around in another, depending on class, context, resolution, etc. This makes selecting the “best” sensor based on prior knowledge unfeasible. Additionally, each modality can be affected differently by the domain shift. For example, camera is deeply impacted by the day-to-night domain change, while LiDAR is relatively robust to it, as shown on the left in Fig. 1.
In order to address these challenges, we propose a Cross-Modal UDA (‘xMUDA’) framework where information can be exchanged between 2D and 3D in order to learn from each other for UDA (see right side of Fig. 1). We use a disentangled 2-stream architecture to address the domain gap individually in each modality. Our learning scheme allows robust balancing of the Cross-Modal and segmentation objective. In addition, xMUDA is complementary to self-training with pseudo-labels 
, a popular UDA technique, as it exploits a different source of knowledge. Finally, it is common practice in supervised learning to use feature fusion (e.g., early or late fusion) when multiple modalities are available[9, 26, 18]: our framework can be extended to fusion while maintaining a disentangled Cross-Modal objective.
Our contributions can be summarized as follows:
We define new UDA scenarios and propose corresponding splits on recently published 2D-3D datasets.
We design an architecture that enables Cross-Modal learning by disentangling private and shared information in 2D and 3D.
We propose a novel UDA learning scheme where modalities can learn from each other in balance with the main objective. It can be applied on top of state-of-the-art self-training techniques to boost performance.
We showcase how our framework can be extended to late fusion and produce superior results.
On the different proposed benchmarks we outperform the single-modality state-of-the-art UDA techniques by a significant margin. Thereby, we show that the exploitation of multi-modality for UDA is a powerful tool that can benefit a wide range of multi-sensor applications.
2 Related Work
In this section, rather than thoroughly going through the literature, we review representative works for each focus.
Unsupervised Domain Adaptation.
The past few years have seen an increasing interest in unsupervised domain adaptation techniques for complex perception tasks like object detection and semantic segmentation. Under the hood of such methods lies the same spirit of learning domain-invariant representations, i.e., features coming from different domains should introduce insignificant discrepancy. Some works promote adversarial training to minimize source-target distribution shift, either on pixel- , feature-  or output-space [25, 27]
. Revisited from semi-supervised learning, self-training with pseudo-labels has also been recently proven effective for UDA [17, 34].
While most existing works consider UDA in the 2D world, very few tackle the 3D counterpart. Wu et al.  adopted activation correlation alignment  for UDA in 3D segmentation from LiDAR point clouds. In this work, we investigate the same task, but differently: our system operates on multi-modal input data, i.e., RGB + LiDAR.
To the best of our knowledge, there are no previous UDA works in 2D/3D semantic segmentation for multi-modal scenarios. Only some consider the extra modality, e.g. depth, solely available at training time on source domain and leverage such privileged information to boost adaptation performance [16, 28]. Otherwise, we here assume all modalities are available at train and test time on both source and target domains.
In a supervised setting, performance can naturally be improved by fusing features from multiple sources. The geometrically simplest case is RGB-Depth fusion with dense pixel-to-pixel correspondence for 2D segmentation [9, 26]. It is harder to fuse a 3D point cloud with a 2D image, because they live in different metric spaces. One solution is to project 2D and 3D features into a ‘bird eye view’ for object detection . Another possibility is to lift 2D features from multi-view images to the 3D point cloud to enable joint 2D-3D processing for 3D semantic segmentation [23, 14, 3]. We are closer to the last series of works: we share the same goal of 3D semantic segmentation. However, we focus on how to exploit multi-modality for UDA instead of supervised learning and only use single view images and their corresponding point clouds.
3D networks for semantic segmentation.
While images are dense tensors, 3D point clouds can be represented in multiple ways which leads to competing network families that evolve in parallel. Voxels are very similar to pixels, but very memory intense as most of them are empty. Grahamet al.  and similar implementation  address this problem by using hash tables to convolve only on active voxels. This allows for very high resolution with typically only one point per voxel. Point-based networks perform computation in continuous 3D space and can thus directly accept point clouds as input. PointNet++ 
uses point-wise convolution, max-pooling to compute global features and local neighborhood aggregation for hierarchical learning akin to CNNs. Many improvements have been proposed in this direction, such as continuous convolutions and deformable kernels . Graph-based networks convolve on the edges of a point point cloud . In this work, we select SparseConvNet  as 3D network which is the state-of-the-art on the ScanNet benchmark .
The aim of Cross-Modal UDA (xMUDA) is to exploit multi-modality by enabling controlled information exchange between modalities so that they can learn from each other. This is achieved through letting them mutually mimic each other’s outputs, so that they can both benefit from their counterpart’s strengths.
Specifically, we investigate xMUDA using point cloud (3D modality) and image (2D modality) on the task of 3D semantic segmentation. An overview is depicted in Fig. 2. We first describe the architecture in Sec. 3.1, our learning scheme in Sec. 3.2, and later showcase its extension to the special case of fusion.
In the following, we consider a source dataset , where each sample consists of 2D image , 3D point cloud and 3D segmentation labels as well as a target dataset , lacking annotations, where each sample only consists of image and point cloud . Images are of spatial size and point clouds of spatial size , where is the number of 3D points within the camera field of view.
To allow Cross-Modal learning, it is crucial to extract features specific to each modality. Opposed to 2D-3D architectures where 2D features are lifted to 3D , we use a 2-stream architecture with independent 2D and 3D branches that do not share features (see Fig. 2).
We use SparseConvNet  for 3D and a modified version of U-Net  with ResNet34  for 2D. Even though each stream has a specific network architecture, it is important that the outputs are of same size to allow Cross-Modal learning. Implementation details are provided in Sec. 4.2.
Dual Segmentation Head.
We call segmentation head the last linear layer in the network that transforms the output features into logits followed by a softmax function to produce the class probabilities. For xMUDA, we establish a link between 2D and 3D with a ‘mimicry’ loss between the output probabilities, i.e., each modality should predict the other modality’s output. This allows us to explicitly control the Cross-Modal learning.
In a naive approach, each modality has a single segmentation head and a Cross-Modal optimization objective aligns the outputs of both modalities. Unfortunately, this leads to only using information that is shared between the two modalities, while discarding private information that is exclusive to each sensor (more details in the ablation study in Sec. 5.1). This is an important limitation, as we want to leverage both private and shared information, in order to obtain the best possible performance.
To preserve private information while benefiting from shared knowledge, we introduce an additional segmentation head to uncouple the mimicry objective from the main segmentation objective. This means that the 2D and 3D streams both have two segmentation heads: one main head for the best possible prediction, and one mimicry head to estimate the other modality’s output.
The outputs of the 4 segmentation heads (see Fig. 2) are of size , where
is equal to the number of classes such that we obtain a vector of class probabilities for each 3D point. The two main heads produce the best possible predictions,and respectively for each branch. The two mimicry heads estimate the other modality’s output: 2D estimates 3D () and 3D estimates 2D ().
3.2 Learning Scheme
The goal of our Cross-Modal learning scheme is to exchange information between the modalities in a controlled manner to teach them to be aware of each other. This auxiliary objective can effectively improve the performance of each modality and does not require any annotations which enables its use for UDA on target dataset . In the following we define the basic supervised learning setup, our Cross-Modal loss , and the additional pseudo-label learning method. The loss flows are depicted in Fig. 3.
The main goal of 3D segmentation is learned through cross-entropy in a classical supervised fashion on the source data. We can write the segmentation loss for each network stream (2D and 3D) as:
where is either or .
The objective of unsupervised learning across modalities is twofold. Firstly, we want to transfer knowledge from one modality to the other on the target dataset. For example, let one modality be sensitive and the other more robust to the domain shift, then the robust modality should teach the sensitive modality the correct class in the target domain where no labels are available. Secondly, we want to design an auxiliary objective on source and target, where the task is to estimate the other modality’s prediction. By mimicking not only the class with maximum probability, but the whole distribution, more information is exchanged, leading to softer labels.
We choose KL divergence for the Cross-Modal loss and define it as follows:
with where is the target distribution from the main prediction which is to be estimated by the mimicking prediction . This loss is applied on the source and the target domain as it does not require ground truth labels and is the key to our proposed domain adaptation framework. For source, can be seen as an auxiliary mimicry loss in addition to the main segmentation loss .
The complete optimization objective for each network stream (2D and 3D) is the combination of the segmentation loss on source and the Cross-Modal loss on source and target:
are hyperparameters to weighton source and target respectively and are the network weights of either the 2D or the 3D stream.
There are parallels between the Cross-Modal learning and model distillation which also adopts KL divergence as mimicry loss, but with the goal to transfer knowledge from a large network to a smaller one in a supervised setting . Recently Zhang et al. introduced Deep Mutual Learning  where an ensemble of uni-modal networks are jointly trained to learn from each other in collaboration. Though to some extent, our Cross-Modal learning is of similar nature to those strategies, we tackle a different distillation angle, i.e. across modalities (2D/3D) and not in the supervised, but in the UDA setting.
Self-training with Pseudo-Labels.
Cross-Modal learning is complementary to pseudo-labelling strategy  used originally in semi-supervised learning and recently in UDA [17, 34]. In details, once having optimized a model with Eq. 4, we extract pseudo-labels offline, selecting highly confident labels based on the predicted class probability. Then, training is done from scratch using the produced pseudo-labels for an additional segmentation loss on the target training set. Effectively, the optimization problem writes:
where is weighting the pseudo-label segmentation loss and are the pseudo-labels.
To evaluate xMUDA, we identified three real-to-real adaptation scenarios. In the day-to-night case, LiDAR has a small domain gap, because it is an active sensing technology, sending out laser beams which are mostly invariant to lighting conditions. In contrast, camera has a large domain gap as its passive sensing suffers from lack of light sources, leading to drastic changes in object appearance. The second scenario is country-to-country adaptation, where the domain gap can be larger for LiDAR or camera: for some classes the 3D shape might change more than the visual appearance or vice versa. The third scenario, dataset-to-dataset, comprises changes in the sensor setup, such as camera optics, but most importantly a higher LiDAR resolution on target. 3D networks are sensitive to varying point cloud density and the image could help to guide and stabilize adaptation.
We leverage recently published autonomous driving datasets nuScenes , A2D2  and SemanticKitti  in which LiDAR and camera are synchronized and calibrated allowing to compute the projection between a 3D point and its corresponding 2D image pixel. The chosen datasets contain 3D annotations. For simplicity and consistency across datasets, we only use the front camera image and the LiDAR points that project into it.
For nuScenes, the annotations are 3D bounding boxes and we obtain the point-wise labels for 3D semantic segmentation by assigning the corresponding object label if a point lies inside a 3D box; otherwise the point is labeled as background. We use the meta data to generate the splits for two UDA scenarios: Day/Night and USA/Singapore.
A2D2 and SemanticKitti provide segmentation labels. For UDA, we define 10 shared classes between the two datasets. The LiDAR setup is the main difference: in A2D2, there are three LiDARs with 16 layers which generate a rather sparse point cloud and in SemanticKitti, there is one high-resolution LiDAR with 64 layers.
We provide more details about the data splits in Sec. A.
4.2 Implementation Details
|Method||2D||3D||softmax avg||2D||3D||softmax avg||2D||3D||softmax avg|
|Baseline (source only)||42.2||41.2||47.8||53.4||46.5||61.3||36.0||36.6||41.8|
|UDA Baseline (PL) ||43.7||45.1||48.6||55.5||51.8||61.5||37.4||44.8||47.7|
|xMUDA w/o PL||46.2||44.2||50.0||59.3||52.0||62.7||36.8||43.3||42.9|
encoder where we add dropout after the 3rd and 4th layer and initialize with ImageNet pretrained weights provided by PyTorch. In the decoder, each layer consists of a transposed convolution, concatenation with encoder features of same resolution (skip connection) and another convolution to mix the features. The network takes an imageas input and produces an output feature map with equal spatial dimensions , where is the number of feature channels. In order to lift the 2D features to 3D, we sample them at sparse pixel locations where the 3D points project into the feature map, and obtain the final two-dimensional feature matrix .
For SparseConvNet  we leverage the official PyTorch implementation and a U-Net architecture with 6 times downsampling. We use a voxel size of 5cm which is small enough to only have one 3D point per voxel.
For data augmentation we employ horizontal flipping and color jitter in 2D, and x-axis flipping, scaling and rotation in 3D. Due to the wide angle image in SemanticKitti, we crop a fixed size rectangle randomly on the horizontal image axis to reduce memory during training. Log-smoothed class weights are used in all experiments to address class imbalance. For the KL divergence for the Cross-Modal loss in PyTorch, we detach
the target variable to only backpropagate in either the 2D or the 3D network. We use a batch size of 8, the Adam optimizer with, and an iteration based learning schedule where the learning rate of is divided by 10 at 80k and 90k iterations; the training finishes at 100k. We jointly train the 2D and 3D stream and at each iteration, accumulate gradients computed on source and target batch. All trainings fit into a single GPU with 11GB RAM.
There are two training stages. First, we train with Eq. 4, where we apply the segmentation loss using ground truth labels on source and Cross-Modal loss on source and target. Once trained, we generate pseudo-labels as in  from the last model. Note, that we do not select the best weights on the validation set, but rather use the last checkpoint to generate the pseudo-labels in order to prevent any supervised learning signal. In the second training with objective of Eq. 5, an additional segmentation loss on target using the pseudo-labels is added; the networks weights are reinitialized from scratch. The 2D and 3D network are trained jointly and optimized on source and target at each iteration.
4.3 Main Experiments
We evaluate xMUDA on the three proposed Cross-Modal UDA scenarios and compare against a state-of-the-art uni-modal UDA method .
We report mean Intersection over Union (mIoU) results for 3D segmentation in Tab. 1 on the target test set for the 3 UDA scenarios. We evaluate on the test set using the checkpoint that achieved the best score on the validation set. In addition to the scores of the 2D and 3D model, we show the ensembling result (‘softmax avg’) which is obtained by taking the mean of the predicted 2D and 3D probabilities after softmax. The baseline is trained on source only and the oracle on target only, except the Day/Night oracle, where we used batches of 50%/50% Day/Night to prevent overfitting. PL is applied separately on each modality (2D pseudo-labels to train 2D, 3D pseudo-labels to train 3D).
In all 3 UDA scenarios xMUDA outperforms both normal and UDA baselines significantly which proves the benefit of exchanging information between modalities. We observe that Cross-Modal learning and self-training with Pseudo Labels (PL) are complementary as the best score is always achieved combining both (‘xMUDA’). This is expected as they represent different concepts: uni-modal self-training with PL reinforces confident predictions while Cross-Modal learning enables knowledge sharing between modalities and can be seen as auxiliary task.
We observe that Cross-Modal learning consistently improves both modalities. Thus, even the strong modality can learn from the weaker one, thanks to decoupling of main and mimicking prediction.
Qualitative results are presented in Fig. 6 and show the versatility of xMUDA across all proposed UDA scenarios. Fig. 7 depicts the individual 2D/3D outputs to illustrate their respective strengths and weaknesses, e.g. at night 3D works much better than 2D. We also provide a video of the A2D2 to Semantic Kitti scenario at http://tiny.cc/xmuda.
4.4 Extension to Fusion
In Vanilla Fusion the 2D and 3D features are concatenated, fed into a linear layer with ReLU to mix the features and followed by another linear layer and softmax to obtain a fused prediction. LABEL:sub@fig:architectureFusionAS In xMUDA Fusion, we add two uni-modal outputs and that are used to mimic the fusion output .
In Sec. 4.3 we show how each modality can be improved with xMUDA and consequently, the softmax average also increases. However, how can we obtain the best possible results by 2D and 3D feature fusion?
A common fusion architecture is late fusion where the features from different sources are concatenated (see Fig. (a)a). However, for xMUDA we need modality independence in the features as the mimicking task becomes trivial otherwise. Therefore, we propose xMUDA Fusion (see Fig. (b)b) where each modality has a uni-modal prediction output which is used to mimic the fusion prediction.
In Tab. 2 we show results for different fusion approaches. ‘xMUDA Fusion w/o PL’ outperforms Vanilla Fusion thanks to Cross-Modal learning. We can improve over ‘Vanilla Fusion + PL’ with ‘Distilled Vanilla Fusion’ where we use the xMUDA model of the main experiments reported in Tab. 1 to generate pseudo-labels from the softmax average and train the Vanilla Fusion network. The best performance can be achieved with xMUDA, combining Cross-Modal learning and PL, analogously to the main experiments.
|Vanilla Fusion (no UDA)||59.9|
|xMUDA Fusion w/o PL||61.9|
|Vanilla Fusion + PL||65.2|
|Distilled Vanilla Fusion||65.8|
5 Ablation Studies
5.1 Segmentation Heads
In the following we justify our design choice of two segmentation heads per modality stream as opposed to a single one in a naive approach (see Fig. (a)a).
In the single head architecture the mimicking objective is directly applied between the 2 main predictions which leads to an increase of probability in the weaker and a decrease in the stronger modality as can be seen for the vehicle class in Fig. (b)b. There is shared information between 2D/3D, but also private information in each modality. An unwanted solution to reduce the Cross-Modal loss is that the networks discard private information, so that they both only use shared information making it easier to align their outputs. However, we can obviously achieve the best performance if the private information is also used. By separating the main from the mimicking prediction with dual segmentation heads, we can effectively decouple the two optimization objectives: The main head outputs the best possible prediction to optimize the segmentation loss, while the mimicking head can align with the other modality.
The experiments only include the first training step without PL as we want to benchmark the pure Cross-Modal learning. From results in Tab. 3, xMUDA has much better performance than the single head architecture and is also much more robust when it comes to choosing a good hyperparameter, specifically the weight of the Cross-Modal loss . As the loss weight becomes too large, single head performance decreases as the network resorts to the trivial solution of predicting the most frequent class to align 2D/3D outputs, while xMUDA performance is robust. Note that we fix optimally for each architecture, 0.1 for single head and 1.0 for xMUDA.
|Single head w/o PL||xMUDA w/o PL|
|2D||3D||softmax avg||2D||3D||softmax avg|
5.2 Cross-Modal Learning on Source
In Eq. 4, Cross-Modal loss is applied on source and target, although we already have supervised segmentation loss on source. We observe an improvement of 4.8 mIoU on 2D and 4.4 on 3D when adding on source as opposed to applying it on target only. This shows that it is important to train the mimicking head on source, stabilizing the predictions, which can be exploited during adaptation on target.
5.3 Cross-Modal Learning for Oracle Training
We have shown that Cross-Modal learning is very effective for UDA. However, it can also be used in a purely supervised setting. When training the oracle with Cross-Modal loss , we can improve over the baseline, see Tab. 4. We conjecture that is a beneficial auxiliary loss and can help to regularize training and prevent overfitting.
We propose xMUDA, Cross-Modal Unsupervised Domain Adaptation, where modalities learn from each other to improve performance on the target domain. For Cross-Modal learning we introduce mutual mimicking between the modalities, achieved through KL divergence. We design an architecture with separate main and mimicking head to disentangle the segmentation from the Cross-Modal learning objective. Experiments on 3D semantic segmentation on new UDA scenarios using 2D/3D datasets, show that xMUDA largely outperforms uni-modal UDA and is complementary to the pseudo-label strategy. An analog performance boost is observed on fusion.
We think that Cross-Modal learning could be useful in a wide variety of settings and tasks, not limited to UDA. Particularly, it should be beneficial for supervised learning and other modalities than image and point cloud.
Appendix A Dataset Splits
The nuScenes dataset  consists of 1000 driving scenes, each of 20 seconds, which corresponds to 40k annotated keyframes taken at 2Hz. The scenes are split into train (28,130 keyframes), validation (6,019 keyframes) and hidden test set. The point-wise 3D semantic labels are obtained from 3D boxes like in . We propose the following splits destined for domain adaptation with the respective source/target domains: Day/Night and Boston/Singapore. Therefore, we use the official validation split as test set and divide the training set into train/val for the target set (see Tab. 5 for the number of frames in each split). As the number of object instances in the target split can be very small (e.g. for night), we merge the objects into 5 categories: vehicle (car, truck, bus, trailer, construction vehicle), pedestrian, bike (motorcycle, bicycle), traffic boundary (traffic cone, barrier) and background.
|Day - Night||24,745||5,417||2,779||606||602|
|Boston - Singapore||15,695||3,090||9,665||2,770||2,929|
|A2D2 - SemanticKitti||27,695||942||18,029||1,101||4,071|
a.2 A2D2 and SemanticKitti
The A2D2 dataset  features 20 drives, which corresponds to 28,637 frames. The point cloud comes from three 16-layer front LiDARs (left, center, right) where the left and right front LiDARS are inclined. The semantic labeling was carried out in the 2D image for 38 classes and we compute the 3D labels by projection of the point cloud into the labeled image. We keep scene 20180807_145028 as test set and use the rest for training.
The SemanticKitti dataset  provides 3D point cloud labels for the Odometry dataset of Kitti  which features large angle front camera and a 64-layer LiDAR. The annotation of the 28 classes has been carried out directly in 3D. We use the scenes as train set, as validation and as test set.
We select 10 shared classes between the 2 datasets by merging or ignoring them (see Tab. 6). The 10 final classes are car, truck, bike, person, road, parking, sidewalk, building, nature, other-objects.
-  (2019) SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV, Cited by: §A.2, §1, §4.1.
-  (2019) nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §A.1, §4.1.
-  (2019) A unified point-based framework for 3d segmentation. In 3DV, Cited by: §2.
4D spatio temporal convnet: minkowski convolutional neural networks. In CVPR, Cited by: §2.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §1, §2.
-  (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, Cited by: §A.2.
-  (2019) A2D2: AEV autonomous driving dataset. Audi Electronics Venture GmbH. Note: http://www.a2d2.audi Cited by: §A.2, §1, §4.1.
-  (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: Figure 2, §2, §3.1, §4.2.
-  (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In ACCV, Cited by: §1, §2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1, §4.2.
-  (2014) Distilling the knowledge in a neural network. In NIPS Workshops, Cited by: §3.2.
-  (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §1, §2.
-  (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Cited by: §2.
-  (2019) Multi-view pointnet for 3d scene understanding. In ICCV Workshops, Cited by: §2.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshops, Cited by: §2, §3.2.
-  (2019) Spigan: privileged adversarial learning from simulation. In ICLR, Cited by: §2.
-  (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, Cited by: Figure 1, §1, §1, §2, §3.2, §4.2, §4.3, Table 1.
-  (2018) Deep continuous fusion for multi-sensor 3d object detection. In ECCV, Cited by: §1, §1, §2, §3.1.
-  (2018) Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In ICLR, Cited by: §2.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. Cited by: §1.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: Figure 2, §3.1, §4.2.
-  (2018) Splatnet: sparse lattice networks for point cloud processing. In CVPR, Cited by: §2.
-  (2019) KPConv: flexible and deformable convolution for point clouds. In ICCV, Cited by: §2.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §2.
-  (2019) Self-supervised model adaptation for multimodal semantic segmentation. IJCV. Cited by: §1, §2.
-  (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, Cited by: §1, §2.
-  (2019) DADA: depth-aware domain adaptation in semantic segmentation. In ICCV, Cited by: §2.
-  (2018) Deep parametric continuous convolutional neural networks. In CVPR, Cited by: §2.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM TOG. Cited by: §2.
-  (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d LiDAR point cloud. In ICRA, Cited by: §A.1.
-  (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, Cited by: §1, §2.
-  (2018) Deep mutual learning. In CVPR, Cited by: §3.2.
-  (2019) Confidence regularized self-training. In ICCV, Cited by: §1, §2, §3.2.