Context Prediction for Unsupervised Deep Learning on Point Clouds

01/24/2019 ∙ by Jonathan Sauder, et al. ∙ 0

Point clouds provide a flexible and natural representation usable in countless applications such as robotics or self-driving cars. Recently, deep neural networks operating on raw point cloud data have shown promising results on supervised learning tasks such as object classification and semantic segmentation. While massive point cloud datasets can be captured using modern scanning technology, manually labelling such large 3D point clouds for supervised learning tasks is a cumbersome process. This necessitates effective unsupervised learning methods that can produce representations such that downstream tasks require significantly fewer annotated samples. We propose a novel method for unsupervised learning on raw point cloud data in which a neural network is trained to predict the spatial relationship between two point cloud segments. While solving this task, representations that capture semantic properties of the point cloud are learned. Our method outperforms previous unsupervised learning approaches in downstream object classification and segmentation tasks and performs on par with fully supervised methods.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point clouds provide a natural and flexible representation of objects in metric spaces. They can also be easily captured by modern scanning devices and techniques. Algorithms that can extract semantic information from point clouds are crucial to countless applications such as robotics and self-driving cars. Traditionally, systems for such tasks have relied on the approximate computation of geometric features such as faces, edges or corners (Van Kaick et al., 2011; Guo et al., 2014) and hand-crafted features encoding statistical properties (Aubry et al., 2011; Rusu et al., 2008)

. However, these approaches are often tailored to specific tasks, thus not providing the necessary flexibility for modern applications. Recently, Convolutional Neural Networks (CNNs) which are domain-independent have shown promising performance on point clouds in supervised learning tasks such as object classification and semantic segmentation

(Qi et al., 2017a, b; Wang et al., 2018; Li et al., 2018), outperforming conventional approaches.

Figure 1: An illustration of the unsupervised context prediction task. A neural network is trained to predict the spatial relation between two point cloud segments. In this example, the correct relation of the two airplane parts is “the red segment is diagonally above the blue segment”.

The advent of scalable 3D point cloud scanning technologies such as LiDAR scanners and stereo cameras gives rise to massive point cloud datasets, possibly spanning large entities such as entire cities or regions. However, manually annotating such massive amounts of data for supervised learning tasks such as semantic segmentation poses problems due to typical real-world point clouds reaching billions of points and petabytes of data, opposing the innate limitations of user-interfaces for 3D data labelling (e.g. drawing bounding boxes) on 2D screens. Therefore, it is of large interest to develop unsupervised methods which can reduce the number of annotated samples needed for strong performance on downstream tasks.

Deep neural networks have been shown to learn powerful hierarchical representations in unsupervised settings across multiple domains (Lee et al., 2009; Mikolov et al., 2013)

. Such methods can generally be classified into methods that learn a mapping of samples to latent space, such as Autoencoders

(Hinton & Salakhutdinov, 2006), or Generative Adversarial Nets (GANs) (Goodfellow et al., 2014)

and methods in which representations are learnt from a common context between objects. Methods that learn from context have been famously applied in natural language processing in the form of word embeddings

(Mikolov et al., 2013) and have also been introduced to images (Doersch et al., 2015). On point clouds, previous unsupervised deep learning has focused on GANs and Autoencoders.

In this work we present an unsupervised learning method based on context prediction operating on raw point cloud data. We train a neural network to predict the spatial relationship between two point cloud segments. Solving this task requires learning representations which encode the semantics of the point cloud segments. An example is givn in Figure 1. Our method circumvents the need for computationally expensive similarity measures on point clouds and provides flexibility in specific network architecture design. Various neural network models that operate on raw point clouds and have been successfully applied to supervised tasks can be used as a drop-in component in our setup. As such networks learn global (scene-level) and local (per-point) representations, our method is also able to learn per-point features in an unsupervised manner.

In a series of experiments, we show that our method learns powerful representations of point clouds. Our method outperforms previous unsupervised methods in a downstream object classification task in a transfer learning setting. We also explore per-point features and show that a simple Multi-Layer Perceptron (MLP) trained on the frozen embeddings obtained from unsupervised training reaches competitive performance with fully supervised approaches. We highlight our main contributions as:

  • We present a flexible unsupervised learning method operating on raw point clouds in which a neural network is trained to predict the spatial context between two point cloud segments. Our method avoids computationally expensive and possibly flawed reconstruction losses or similarity metrics on point clouds.

  • We demonstrate the ability of the obtained embeddings. Our method outperforms state-of-the-art unsupervised methods in a downstream object classification task.

  • To the best of our knowledge, we are the first to explore per-point unsupervised embeddings in part segmentation and semantic segmentation tasks. A simple classifier trained on frozen obtained unsupervised point embeddings performs on par with fully supervised approaches.

2 Related Work

2.1 Deep Learning on Point Clouds

Deep neural networks have shown impressive performance on regularly structured data representations such as images and time series. However point clouds are unordered sets of vectors, therefore exemplifying a class of problems posing challenges for deep learning for which the term

geometric deep learning (Bronstein et al., 2017) has been coined. Although deep learning methods for unordered sets (Vinyals et al., 2015; Zaheer et al., 2017) have been proposed and also applied to point clouds (Ravanbakhsh et al., 2016), these approaches do not leverage spatial structure.

To address this problem, popular point cloud representations suitable for deep learning include volumetric approaches (Maturana & Scherer, 2015), in which the containing space is voxelized to be suitable for 3D CNNs and multi-view approaches (Savva et al., 2017), in which 3D point clouds are rendered into 2D images fed into 2D CNNs. However, voxelized representations can be difficult to use when the point cloud density varies, and as such are constrained by the resolution and limited by the computational cost of 3D convolutions. Despite multi-view approaches having shown strong performance in classification of standalone objects, it is unclear how to extend them to work reliably in larger scenes scenes (e.g. with covered objects) and on per-point tasks such as part segmentation (Qi et al., 2017a).

A more recent approach, pioneered by PointNet (Qi et al., 2017a)

, is feeding raw point cloud data into neural networks. As point cloud are unordered sets, such networks have to be permutation invariant - PointNet achieves this by using the max-pooling operation to form a single feature vector representing the global context from a variable amount of points. PointNet++

(Qi et al., 2017b) proposes an extension that introduces local context by stacking multiple PointNet layers. Inspired by advancements in deep learning on graphs (Masci et al., 2015; Yi et al., 2017), further improvements were made by introducing Dynamic Graph CNNs (DGCNNs) (Wang et al., 2018), in which a graph convolution is executed on edges of the k-nearest neighbor graph of the point clouds, which is dynamically recomputed in feature space after each layer. Even better performance was achieved by PointCNN (Li et al., 2018), which uses a hierarchical convolution that is trained to learn permutation invariance. All neural networks operating on raw point cloud data naturally provide per-point embeddings, making them particularly useful for point segmentation tasks. Our proposed method can leverage these methods as it is flexible with regards to the use of specific neural network architecture.

Figure 2: The architecture for context prediction on point clouds. The model receives two input point cloud segments which stand in a certain spatial relation to each other and is trained to predict the correct spatial relation label. The use of an intermediate model is flexible, any neural network usable for prediction tasks on point clouds can be used instead of a DGCNN. Per-point embeddings are obtained before the max-pooling layers, per-segment labels are obtained after max-pooling.

2.2 Unsupervised Deep Learning

Deep learning algorithms have demonstrated the ability to learn powerful internal hierarchical embeddings through unsupervised learning tasks (Lee et al., 2009). These representations can be directly used in downstream tasks or as strong initializers for supervised tasks (Mikolov et al., 2013; Erhan et al., 2010). In cases where large amounts of data are available but annotated samples are scarce, unsupervised learning can significantly reduce the number of annotated training samples required for strong performance in various tasks (Yang et al., 2018), making effective unsupervised methods for point clouds particularly desirable. Unsupervised methods for deep learning can be loosely categorized into methods that learn a mapping between samples and a latent space such as Generative Adversarial Nets (Goodfellow et al., 2014) and Autoencoders (Hinton & Salakhutdinov, 2006), and methods that learn through the common context between a set of objects such as unsupervised word embeddings (Mikolov et al., 2013).

GANs have shown impressive results in the image domain (Brock et al., 2018), and have also been applied to point clouds (Wu et al., 2016). However, such approaches work on volumetric point cloud representations as GANs rely on the ability to generate artificial samples which is non-trivial for unordered sets such as raw point cloud data. A volumetric Autoencoder for point clouds titled VConv-DAE (Sharma et al., 2016) has also been proposed, as well as another approach trains a GAN on the latent space mapping obtained through training a separate Autoencoder (Achlioptas et al., 2017). The quality of the embeddings produced by these approaches is conventionally evaluated in object classification as a downstream task, where a linear SVM is trained on the embeddings obtained from unsupervised training on a larger, unlabelled dataset. The best performance thus far on such a task is achieved by FoldingNet (Yang et al., 2018), an Autoencoder that learns to fold a regular 2D grid around a point cloud.

All mentioned unsupervised methods for point clouds rely on similarity metrics such as the Earth Mover’s Distance (EMD) (Rubner et al., 2000) for which the computational cost is infeasible, or the closely related differentiable Chamfer (pseudo) distance. The Chamfer distance for two point clouds is the distance between each point to the closest point in the other set:

The Chamfer distance can also be inefficient to compute in large point clouds, but more importantly, the authors (Achlioptas et al., 2017) observe that specific pathological cases are handled incorrectly (Achlioptas et al., 2017). This motivates unsupervised methods such as ours which avoid such potentially problematic similarity functions.

While approaches such as GANs and Autoencoders are extremely popular in the image domain (possibly due to their generative ability), a completely different approach to unsupervised learning for images is taken by (Doersch et al., 2015). They split an image into patches and train a neural network to predict how two patches from the same image correspond, demonstrating that the learned features can be a powerful initialization for supervised training and that the learned embeddings successfully capture semantics. We build on this approach and adapt the idea of spatial context prediction to point clouds, which have certain characteristics that make them particularly well suited for such a task. The authors (Doersch et al., 2015) argue that such a context prediction task addresses the problem that predicting pixels is more difficult than predicting words, as a large variety of pixels can arise from the same semantic object. This holds even more true when moving from images, i.e. regular grids in 2D space, to unordered sets in 3D space. Furthermore, context prediction phrases the unsupervised learning problem as a classification task, therefore avoiding possibly problematic point cloud similarity metrics.

3 Method

3.1 Learning Representations from Context

In this paper we propose a method that learns high-level representations from context in point clouds in an unsupervised fashion. Our method works by training a neural network to predict the spatial relation between two point cloud segments. While learning to solve this unsupervised task, representations that capture the semantics of the point cloud are learned. We extend the concept of context prediction proposed by (Doersch et al., 2015) from images to point clouds, as point clouds have particularly desirable properties for such a task.

In general, a context gives rise to the semantics of a single entity. For example, a sentence puts a word in context and an image puts a pixel in context. Similarly, the context of a point in space is determined by the surrounding point cloud. From a context it is possible to extract relations, i.e. semantic relationships between two entities. In point clouds, an example for such a relation between two of its segments could be “ is part of ”. Other useful relations can be data-dependent e.g. “ is in summer” could relate a 3D scanned leafless tree in wintertime to its green summer pendant through a temporal dimension. Spatial relations are innate to point clouds in euclidean space, e.g. “ is above ” or “ is when rotated and shifted” and can be used independently of domain and data. We propose a deep neural net architecture which is flexible with regards to the set of possible relations to learn from.

An overview of our architecture is shown in Figure 2. The input is given by two point cloud segments standing in a relation to each other. The learning objective is a classification task in which each class represents a possible relation. This can be done by minimizing the multi-label cross entropy, without needing to compute any point cloud similarity metrics. The weights of the network taking the variable-sized point cloud as inputs and producing a fixed size feature vector are shared between the two point cloud inputs. The exact choice of this network is flexible: while models operating on raw point clouds, such as PointNet or DGCNN, are particularly interesting as they produce per-point embeddings and are robust to varying densities, volumetric or multi-view networks could also be used. In fact, future work proposing better-performing networks for fully supervised tasks should translate directly into better performance within a context prediction task.

3.2 Spatial Context in Point Clouds

In our experiments we focus exclusively on spatial relations, making our method universally applicable to any point cloud. An exhaustive analysis of possible relations which can lead to better results in specific tasks is out of scope for this paper. Instead, we empirically show that even with only simple spatial relations, our proposed method can achieve good results. While in images, spatial relations are limited to selecting patches from different locations in the image (Doersch et al., 2015), spatial relations in point clouds naturally provide more flexibility as they are not constrained by a discrete grid, e.g. allowing for rotation and scaling.

In all completed experiments we use the same set of ten relations. Specifically, we include relations over the height axis, i.e. “ is above ” or “ is below ”. To avoid the triviality of simply comparing the height-values, we scale each input segment to unit ball, losing global height values but preserving height values within each segment. We treat relations on the non-height axes as equal, i.e. there is no distinction between “behind” or “beside” as objects are not always facing in the same direction. We do however distinguish between “ is next to ” and “ is diagonally next to ”. Combining these, e.g. ” is diagonally below ” leads to eight spatial relations. Furthermore, we include the relations “ is rotated” by using a random rotation, and “ is not related to ” realized by choosing two random point cloud segment within the entire training dataset.

3.3 Model

In all experiments we use the same model architecture in the unsupervised learning task 111Full code is provided in supplementary material, will be made publicly available upon publication. We refer to our model as CloudContext. It contains a simplified version of DGCNN (Wang et al., 2018)

as it provides a reasonable trade-off between performance and computational complexity. The simplified version uses no spatial transform network and consists of four EdgeConv layers on the 20-nearest neighbor graph with filter sizes 64, 64, 64, 128 and a last convolution that concatenating previous layers through skip connections with filter size 256. This leads to a per-point and per-segment embedding of size 256. Two dense layers of size 1024 and 512 follow before the final linear layer to the correct number of relation class labels. We use ReLU activations and batch normalization on all layers, and


dropout on the two dense layers. We use Adam as optimizer with a fixed learning rate of 0.0001. To improve generalization, we augment the input point clouds. The entire point cloud is scaled to the unit ball before being rotated around the height-axis by a random multiple of 90 degrees. Then each spatial dimension is independently scaled by a random variable drawn from

, the point cloud is entirely shifted by a random vector drawn from , and each point is jittered on by adding a vector drawn from .

4 Experimental Results

4.1 Object Classification

In this section, we show that the embeddings learned with our method, referred to as CloudContext, perform strongly in object classification as a downstream task and show that we outperform state-of-the-art unsupervised methods before comparing our method to fully supervised approaches.

In line with previous approaches, we evaluate our performance on the object classification problem using the ModelNet dataset (Wu et al., 2014), which contains CAD models from different categories of man-made objects. For this we use the standard train/test split, with the same uniform point sample as defined in (Qi et al., 2017a) with ModelNet40 on 40 classes containing 9843 train and 2468 test models and ModelNet10 on ten classes containing 3991 and 909 models respectively.

To obtain object-level embeddings from our method, we combine all segment-level embeddings that make up a single object into a single embedding through concatenation. For the first experiment, we train a SVM as in (Achlioptas et al., 2017; Wu et al., 2016; Yang et al., 2018) with representations obtained using a dataset different from ModelNet: The ShapeNet dataset (Chang et al., 2015) consists of 57448 models from 55 categories. From every model we use only a random sample of 2048 points as provided by (Yang et al., 2018). In the second experiment, comparing to end-to-end supervised methods, our method still proceeds unsupervised as in the first experiment, but with embeddings learned using ModelNet instead of ShapeNet in order to have only exactly the same information available as the supervised methods.

Model MN40 MN10
VConv-DAE 75.5% 80.5%
3D-GAN 83.3% 91.0%
Latent-GAN 85.7% 95.3%
FoldingNet 88.4% 94.4%
CloudContext (ours) 89.3% 94.5%
Table 1: Comparison of unsupervised methods in downstream object classification on the ModelNet40 and ModelNet10 dataset in terms of accuracy. All methods use a linear SVM trained on the representations learned in an unsupervised fashion on the ShapeNet dataset

Figure 3: Visualization of the object embeddings of the ModelNet10 test data obtained through training with context prediction on the ShapeNet dataset. t-SNE with perplexity 10 and 1000 iterations was used for dimensionality reduction.

Figure 4: A plot of the unsupervised training loss on the ShapeNet dataset and the linear SVM accuracy trained on obtained embeddings for the ModelNet dataset. Performing better on the unsupervised tasks results in stronger embeddings for downstream object classification.

In the unsupervised setting our method outperforms all previous approaches on ModelNet40, and all except Latent-GAN on ModelNet10. However, as noted by (Yang et al., 2018), the point cloud format and sampling procedure from Latent-GAN is not publicly available, making a comparison on this accuracy inconclusive. The obtained embeddings for the ModelNet10 test data are visualized using t-SNE (Maaten & Hinton, 2008) in Figure 3. One can see that clear clusters are formed for each class which are separable except for the classes dresser (violet) vs nightstand (pink), which are visually similar when scaled to the unit cube, as done in the ShapeNet dataset. Figure 4 highlights the effectiveness of context prediction: a decrease in unsupervised training loss gives a better downstream classification accuracy, which suggests that our method is able to capture more relevant semantics with more training, enabling better linear separability.

A main motivation for unsupervised training is that in a large dataset, a very small number of labelled samples can suffice to achieve strong performance in machine learning tasks. We evaluate context prediction in such a scenario by limiting the number of training samples available for SVM training in the ModelNet object classification task. We sample according to the following procedure: first we randomly sample one object per class, and then sample the remaining objects uniformly out of the entire training set. The results are shown in Figure

5. Using only 1 % of training data, equivalent to roughly three samples per class, our model is able to achieve 61.1 % accuracy on the test set. When using 10 % of available training samples, this accuracy rises up to 82.6 %.

Figure 5: Figure showing how the linear SVM classification accuracy for ModelNet40 behaves when few annotated training samples are available.

In the second experiment we compare our approach to fully supervised approaches. All models, including our unsupervised approach, are trained on the ModelNet40 train dataset. We train a kernel SVM on the obtained embeddings in our approach. The results are shown in Table 2. Our unsupervised approach can outperform the fully supervised PointNet++.

Model MN40
PointNet 89.2%
PointNet++ 90.7%
DGCNN 92.2%
PointCNN 92.2%
CloudContext 90.8
Table 2: Comparison of our method to leading fully supervised methods. All methods were trained only on the ModelNet dataset, a SVM is trained on the embeddings obtained through context prediction.

Mean Aero Bag Cap Car Chair Ear- phone Guitar Knife Lamp Laptop Motor Mug Pistol Rocket Skate- board Table
# Shapes 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271
PointNet 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
PointNet++ 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
DGCNN 85.1 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0
Ours + 3 Layer MLP 81.5 79.1 80.8 85.8 74.1 87.7 73.6 88.4 83.7 75.6 95.4 60.0 93.2 76.3 54.4 77.9 80.2
Table 3: Comparison of simple MLP trained on the frozen embeddings obtained through unsupervised context prediction to fully supervised methods. Metric is mIoU % on points.

4.2 Part Segmentation

In this section we explore the per-point embeddings obtained through unsupervised training in a part segmentation task. Again, we train our model in an unsupervised fashion on the ShapeNet dataset. The task is then to correctly classify points as belonging to an object part on the ShapeNet Part dataset (Yi et al., 2016), which is a subset of the full ShapeNet containing 16881 3D objects from 16 categories, annotated with 50 parts in total. We use the official train / validation / test splits. Each point is represented as the concatenation of the frozen point embedding with the point’s corresponding segment embedding obtained through unsupervised training. Following the same procedure as in (Qi et al., 2017a, b; Wang et al., 2018), we also concatenate the class label of the object. We then train a 3-hidden layer fully connected MLP with layer sizes of 1024, ReLU activations, batch normalization and 50 % dropout on each hidden layer on the embeddings. Part segmentation is evaluated on the mean Intersection-over-Union (mIoU) metric, calculated by averaging IoUs for each part in an object before averaging the obtained values for each object class. The results are shown in Table 3. Our approach shows competitive performance with fully supervised methods. CloudContext outperforms PointNet in six object classes and can outperform every evaluated methods in two classes.

4.3 Semantic Segmentation

The semantic segmentation task is evaluated on the Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset (Armeni et al., 2017). The dataset consists of 3D point cloud scans from 6 indoor areas totalling 272 rooms. The points are classified into 13 semantic classes such as board, chair, ceiling, beam, and clutter. For fair comparability, we use the exact same procedure as in (Qi et al., 2017a; Wang et al., 2018), in which each room is split into blocks of area and each point is given as a 9D vector containing XYZ coordinates, RGB color values, and normalized coordinates indicating the position in the room. A 6-fold cross validation is performed in which each of the areas is used as test set once. We train in an unsupervised manner on each training split data, and use the same setup for classification on the obtained embeddings as in the part segmentation task. The results are summarized in Table 4. Despite the embeddings being learned in an unsupervised manner, CloudContext is able to slightly outperform PointNet in classification accuracy and performs on par with PointNet in terms of mIoU. This demonstrates that our method is applicable to point clouds that capture scenes that go beyond simple free-standing simple objects such as included in the ShapeNet or ModelNet datasets.

Mean IoU Total Accuracy
PointNet 47.6% 78.5%
DGCNN 49.7% 84.1%
CloudContext (ours) 47.6% 78.9%
Table 4: Results of semantic segmentation on the S3DIS dataset.

5 Discussion

While learning to predict spatial relations of point cloud segments in an unsupervised manner, a neural network learns object-level and per-point semantic representations of objects. For example, to successfully predict that a part of a lampshade is above part of the stand, our model needs to learn a representation of which point cloud segments resemble what parts of objects. These embeddings can then be used to classify objects, e.g. “a lamp has a lampshade and a stand”. We found that including the relation “ is not related to ” which we realize by selecting a random point cloud segment from the entire training dataset improves accuracy by  % in the ModelNet40 object classification task. This is unsurprising, as learning e.g. that an airplane wing is not from the same object as the leg of a table can encode information in the embeddings that is useful for discriminating between object classes. We found that using the concatenation of all segment embeddings as an object embedding results in  % higher accuracy than when using information about the relations between an object’s segments (i.e. by using any of the dense layers).

The specific neural network architecture used for context prediction is flexible. We found that using our simplified version of DGCNN outperforms using a PointNet as described by (Qi et al., 2017a) by % in the ModelNet40 object classification task. As this difference resembles the difference in supervised accuracy reached by DGCNN over PointNet ( % vs.  %), we expect context prediction to further improve when new advancements in neural network architectures for point clouds are made. For comparability to previous methods, we always use the same data input format as in (Qi et al., 2017a; Wang et al., 2018) in which point clouds come in pre-processed chucks designed to be effective for fully supervised learning. Context prediction could instead benefit from learning in a contiguous point cloud setting, allowing to learn from relations with varying granularity. One of the main motivations for unsupervised learning in point clouds is reducing the number of labelled samples required for strong performance in downstream tasks. Our method succeeds in this aspect.

The authors (Doersch et al., 2015) observe that when using context prediction for images, the neural network quickly finds trivial solutions in low-level information such as similar textures or chromatic aberration. However, we observe that in none of our experiments, the neural network quickly learns to correctly identify the relations without producing useful embeddings for downstream tasks. Our results suggest that trivial solutions are not detrimental to unsupervised learning in a point cloud settings.

6 Conclusion

In this paper we propose an unsupervised method for learning representations in point clouds from context. In this method, a neural network learns to predict the spatial relationship between two point cloud segments. While learning to solve this task, a neural network learns representations encoding the semantics of the point cloud. We propose a generalizable setup, in which a concrete neural network architecture can be flexibly used as a drop-in component. We use a simplified DGCNN within our setup and expect to see improvements with context prediction in the future as advancements in deep learning architectures for point clouds are made. We demonstrate that the embeddings obtained through unsupervised training can lead to strong performance in downstream tasks: in an object classification task, our method outperforms state-of-the-art unsupervised methods. We also explore unsupervised per-point embeddings in point segmentation tasks: training a simple classifier on the obtained unsupervised embeddings leads to competitive performance with fully supervised methods. Using our unsupervised method in downstream tasks can result in strong performance even when the number of labelled samples is extremely small. This is useful as point clouds in practice such as 3D LiDAR scans contain massive amounts of data, but are cumbersome to label. In the scope of this paper we focus exclusively on simple spatial relations. An exhaustive analysis of possible relations and experimenting with input data formats other than pre-processed chunks of fixed sizes that could improve learning is left as future work.