Picasso
None
view repo
We present Picasso, a CUDAbased library comprising novel modules for deep learning over complex realworld 3D meshes. Hierarchical neural architectures have proved effective in multiscale feature extraction which signifies the need for fast mesh decimation. However, existing methods rely on CPUbased implementations to obtain multiresolution meshes. We design GPUaccelerated mesh decimation to facilitate network resolution reduction efficiently onthefly. Pooling and unpooling modules are defined on the vertex clusters gathered during decimation. For feature learning over meshes, Picasso contains three types of novel convolutions namely, facet2vertex, vertex2facet, and facet2facet convolution. Hence, it treats a mesh as a geometric structure comprising vertices and facets, rather than a spatial graph with edges as previous methods do. Picasso also incorporates a fuzzy mechanism in its filters for robustness to mesh sampling (vertex density). It exploits Gaussian mixtures to define fuzzy coefficients for the facet2vertex convolution, and barycentric interpolation to define the coefficients for the remaining two convolutions. In this release, we demonstrate the effectiveness of the proposed modules with competitive segmentation results on S3DIS. The library will be made public through https://github.com/hleiziyan/Picasso.
READ FULL TEXT VIEW PDFNone
Data in computer vision vary commonly from homogeneous format in 2D projective space (e.g. images, videos) to heterogeneous format in 3D Euclidean space (e.g. point clouds, meshes). The success of convolutional feature learning on homogeneous data
[21, 28, 37, 38, 46, 47, 50, 55] has sparked research interest in geometric deep learning [6, 7, 14, 26, 54], which aims for equally effective feature learning on heterogeneous data. Due to the rise of autonomous driving and robotics, 3D deep learning has now become an important branch of the geometric research direction. Compared to 3D point clouds, 3D meshes convey richer geometric information about the object surface and topology. Yet, the heterogeneous facet shapes and sizes combined with unstructured vertex locations make its adaption to deep learning more difficult as compared to point clouds. This is why most approaches address realworld 3D scene understanding via convolutions on point clouds
[31, 32, 33, 35, 43, 44]. However, point clouds still lack in preserving the structural details that are easily represented by meshes. There are a few works that learn features from meshes, but they are largely constrained to shape analysis on small synthetic models [5, 20, 39, 41, 45, 68]. These methods either apply convolutions throughout the network to a single mesh resolution (the input), or exploit inefficient CPUbased algorithms to decimate the mesh [16, 17, 51, 71]. However, nonhierarchical network configurations and slow network coarsening are both problematic while dealing with realworld meshes because of their largescale nature. This calls for mesh simplification methods that are fast and amenable to deep learning for the realworld applications.(a) The input mesh  (b) VC: ms  (c) QEM: ms  (d) Ours: ms 
We present a GPUaccelerated mesh simplification algorithm to facilitate the exploration of hierarchical architectures on meshes. The proposed method is not only fast in decimating smallscale watertight meshes from CAD modelling [4, 9, 11, 36], but is also efficient in simplifying largescale unstructured realwold meshes [2, 8, 12, 22]. We perform all computations in parallel on GPU, except for the grouping of vertex pairs to be contracted. Meanwhile, to increase its compatibility with modern deep learning modules such as normalization [3, 24, 60, 66], we contract vertex clusters by controlling the desired vertex size of the decimated mesh. This also advances meshbased modules to be exploited in conjunction with the various point cloud based modules [52]. Our algorithm is able to reduce the number of mesh vertices by half in each iteration. Figure 1 compares the runtime of our method with two wellfounded decimation methods, VC [51] and QEM [17]. Notice that our method is
faster than QEM. During the simplification, we record all vertex clustering information into a 1D tensor. Based on this tensor, we also define max, average and weighted poolings, as well as unpooling. Earlier attempts for convolution on meshes
[5, 39, 41] explored local patch operators in handcrafted coordinate systems. The development of spatial graph convolutions has led recent methods [20, 45, 52] to predominantly consider (triangular) mesh as a special graph and convolve features of each vertex from their geodesic ring neighborhood. In contrast, we study mesh as a set of vertices and facets, following its natural geometric structure. To learn features of each vertex, we aggregate their context information from the adjacent facets. We refer to the resulting operation as facet2vertex convolution. Lei et al. [32] showed that fuzzy mechanism makes the convolutional network robust to point density variation. Hence, we further exploit fuzzy coefficients in the facet2vertex convolution. Due to the fact that facet normals are distributed strictly on the surface of a unit sphere, i.e. , we associate learnable filter weights to Gaussian mixtures defined on . The parameters of the Gaussian components can optionally be kept fixed or trainable within the network in our library. On the other hand, to learn features of the mesh facets, we introduce vertex2facet and facet2facet convolutions. The former propagates features from the vertices of a facet to the facet itself, while the latter is applied when facets of the input mesh are rendered with textures. We incorporate fuzziness into these two convolutions using barycentric interpolation. The three proposed convolutions altogether enable flexible vertexwise and facetwise feature learning on the mesh. We provide CUDA implementations for all the above mentioned modules and organize them into a selfcontained library, named Picasso^{1}^{1}1Paying homage to Pablo Picasso for cubism in paintings., to facilitate deep learning over the unstructured realworld 3D meshes. We note that meshes and point clouds are tightly bonded together, and it is more desirable to extract features from the two data modalities cooperatively rather than individually or competitively. DCMNet [52] also validates this argument. For this reason, we additionally incorporate all the point cloud modules from Lei et al. [32, 33] in our library (with author permission). In this maiden release of our library, we demonstrate promise of its proposed modules with competitive segmentation results on S3DIS [2]. The segmentation network is abbreviated as PicassoNet for consistency. We summarize the main contributions of our work below:We present a fast GPUaccelerated 3D mesh decimation technique to reduce mesh resolution onthefly. A public implementation with complementary CUDAbased pooling and unpooling operations is provided.
We propose three novel convolution modules, i.e. facet2vertex, vertex2facet, facet2facet, to alternatively learn vertexwise and facetwise features on a mesh. Diverging from existing methods, we do not rely on restrictive treatment of mesh as an undirected graph with edges. Instead, it’s a geometric structure composed of vertices and facets for our modules, which is also a more conducive representation for the digital devices.
With this paper, we release Picasso — a selfcontained library for deep learning over unstructured realworld 3D meshes, along with synthetic watertight meshes. The provided anonymous github link will be made public for the broader research community.
Convolutions on point clouds: 3DCNNs [15, 18, 40, 49, 67] are the most intuitive solutions of transferring CNNs from images to point clouds. A few methods also explore similar regulargrid kernels in a transformed data domain [56, 57]
. The permutation invariant networks exploit multilayer perceptrons (MLPs) and maxpooling to learn features from point clouds
[27, 34, 43, 44, 48, 53, 65]. They demonstrate the effectiveness of taking point coordinates as input features. Graphbased networks allow convolutions to be performed in either spectral or spatial domain. However, the mandatory alignment of different graph Laplacians makes the application of spectral graph convolutions to point clouds more difficult than spatial graph convolutions [70]. As a pioneering work, ECC [54] exploited dynamic filters [13] to analyze point clouds with the spatial graph convolutions. Subsequent works explored more effective filter or kernel parameterizations [19, 35, 62, 65, 69]. The recent discrete kernels [31, 32, 33, 59] are appealing alternatives to those dynamic methods as they avoid the dependence of filter generations within the network. KPConv [59] reported more competitive results, while the spherical kernels [32, 33] are more memory and runtime efficient. Convolution on meshes: In nascency of this direction, researchers generally performed convolutions over local patches defined in a handcrafted coordinate system [5, 39, 41]. The coordinate system could either be established by geodesic level sets [39] or surface normals and principle curvatures [5, 41]. In FeaStNet [61], Verma et al. capitalized on a learnable mapping between filter weights and graph neighborhoods to replace those handcrafted local patches. TextureNet [23] takes surface meshes with highresolution facet textures as input, and explores a 4RoSy field to parameterize the mesh into local planar patches such that standard CNNs [28] are applicable. Ranjan et al. [14] proposed to learn human facial expressions with hierarchical meshbased networks, using the spectral graph convolutions. Schult et al. [52] proposed to extract features from both meshes and point clouds simultaneously. Similar to [35, 54, 64, 65], they still conduct convolution using dynamically generated filters. Whereas most methods learn vertexwise features, MeshCNN [20] defines convolution to learn edgewise features on a mesh. Generally, previous methods treat mesh as an edgebased graph and employ geodesic convolutions over it. In this paper, we explore convolutions on the mesh following its own geometric structure, i.e. vertices and facets. To promote this more natural perspective, we also provide computation and memory optimized CUDA implementations for forward and backward propagations of all the convolutions we propose. Mesh decimation: Hierarchical networks allow convolutions to be applied on an increasing receptive fields of the input data. To create such hierarchical architectures on point clouds, researchers usually exploit random point sampling or farthest point sampling (FPS) [33, 44, 65]. However, because of their inability to track vertex connections, they are not applicable to mesh processing. Fortunately, mesh simplification is a wellstudied topic in the graphics community. There are methods available that can be used to decimate a mesh [16, 17, 51]. For example, Ranjan et al. [45] used the quadric error metrics [17] to simplify their synthetic facial meshes. Schult et al. [52] explored vertex clustering [51] and quadric error metrics [16, 17] to simplify their indoor room meshes. In particular, they made use of the simplification functions provided by Open3D [71]  a popular library for 3D geometry processing in Python. However, the implementations in Open3D are CPUbased, which are not amenable to deep learning. In this work, we also introduce a fast mesh decimation method based on the algorithm of Garland et al. [16, 17]. Their method simplifies a mesh through iterative contractions of vertex pairs, and demands that the vertex pair contribution to quadric error be determined after each iteration. This progressive strategy makes the algorithm impossible for parallel deployment on GPUs. In comparison, we sort all the vertex pairs by their quadric errors only once, and group them into isolated vertex clusters. All other computations in our method are mutually independent and can be accelerated via parallel GPU computing. We represent the vertex cluster information as a 1D tensor to facilitate the pooling and unpooling operations.To explore flexible deep neural networks for meshes, there has been an increasing demand for a fast mesh decimation method in the 3D community. The quadric error simplification
[17] performs iterative contraction of vertex pairs until the desired number of facets is obtained. This method is known to be effective for simplifying meshes while retaining the quality. However, it is not suited to parallel computing due to the implicit dependencies between its iterative contractions. Considering its high relevance, we provide the quadric error method as Algorithm 1. The iterative dependency of the method is clear from lines 7–13 of the algorithm. We extend [17] and propose a fast mesh decimation method that exploits the parallel computing power of a GPU. The main difference is that we do not perform iterative contractions any more. Instead, we group the vertices into multiple isolated clusters based on their connections. Figure 2 provides a toy example to illustrate the clustering process. During clustering, we control the expected vertex number rather than the number of facets or edges. Since the contractions of isolated clusters are independent of each other, they can be executed on a GPU in parallel. More specifically, we initialize the candidates of vertex pairs to be contracted using existing mesh edges only. The vertex clusters are established with disjoint vertex pairs in the candidates, before which we sort the candidates in an ascending order by their quadric cost^{2}^{2}2Shuffling strategy is inserted into the sorting of quadrics for variations of the decimated mesh at different epochs.
. See Algorithm 2 for our method. We note that our core algorithm reduces the vertex number roughly by a half, periteration. Yet, we can still handle arbitrary number of output vertices. It is achieved by running the core algorithm multiple times. For clarity, we present Algorithm 2with a single mesh taken as the input. In our library, the decimation function can take minibatches of concatenated meshes as input. This improves its compatibility with deep learning, e.g. batch normalization
[24] is easily applicable. The vertex clustering operations (lines 3–11 in Algorithm 2) is currently executed on CPU, that helps in deploying heavy computations to GPU. We note that consistency checks and penalties are excluded in our method in favor of better runtime efficiency. For continuity, we delegate the details on complementary output parameters replace and mapping to the source code itself.Pooling and unpooling: We record the vertex clustering information as well as the degenerated vertex information into different 1D tensors. The sizes of the two tensors are both identical to the input vertex number. They largely facilitate the pooling and unpooling operations. Given the vertex clusters as neighborhood, different pooling operations can be defined, such as sum, average, max and weighted. We provide average, max, weighted poolings in the library. For unpooling, we simply interpolate features of all vertices in a cluster based on features of the representative vertex in the cluster.
Let be a triangular mesh. The input features of its vertices are , while the normals and areas of each facet are and . If the mesh is rendered, we denote the textures of a facet as , in which represents the texture resolution of the facet, relates to the colors. Since the facets usually have varying areas, we allow varying for different facets as well.
We compute the features of each facet based on the features of its vertices. In particular, we define a kernel composed of 3 filters associated to the three vertices of the triangular facet. The Barycentric interpolation is used to incorporate fuzzy scheme into the convolution. We determine the total number of interpolated points as
(1) 
in which are the minimum and maximum facet areas of the mesh, while are the hyperparameters. We use in our experiments. Let be the vertex features, and be the filter weights. The Barycentric coordinates satisfies . We use it to interpolate interior points uniformly on the facet, whose features and filter weights are computed respectively as
(2)  
(3) 
Here, is the total number of interpolated points defined based on the facet area. With and , we finally compute the feature of each facet as
(4) 
Facet2Facet convolution: The facet2facet convolution is only applicable when the input mesh is rendered with textures. Each facet contains interpolated points with texture features of size . The definition of the facet2facet convolution is quite similar to that of the vertex2facet convolution. The main difference is that in facet2facet convolution, there is no need to interpolate features for the interior points since their features are already available. We only need to interpolate the filter weights , and then compute the facet features following Eq. (4).
(a) front  (b) back  (c) top  (d) bottom 
We compute vertex features from the features of their adjacent facets. Considering that the normals of each facet are strictly located on the surface of a unit sphere, we define filter weights of our kernel by associating them to different positions on the sphere. Besides, we observe that normals of the realworld meshes generally distribute in distinctive patterns on a unit sphere, especially for the indoor meshes. This is related to the construction preferences of humanbeings. We show an example in Fig. 3. The data is taken from S3DIS [2]. There are six main clustering patterns. They correspond to different normal directions, which are roughly . Consequently, we exploit the mixture of different Gaussian components to divide the sphere surface and implicitly cluster normals. Let the total number of Gaussian components be , their expectations and covariance matrices be and . Using its normal , we compute the fuzzy coefficients of the facet as
(5)  
(6) 
For simplicity, we use homogeneous diagonal matrix for in our experiments. Let the filter weights in the kernel be , the adjacent facets of vertex be , and the facet features be . The vertex feature gets computed as
(7) 
In our library, we allow the expectations and covariance matrices of the Gaussians to be constant, or learnable together with the filter weights during training. We initialize the Gaussian means as regularly distributed points on the sphere in our experiments and keep those constant, while allowing the covariance matrices as learnable parameters. We note that the facet2vertex convolution is scale and translation invariant but not rotation invariant because its fuzzy coefficients are computed based on facet normals.
For all the proposed convolutions, we split the operation into channelwise and depthwise operations. Doing so is known to improve the computational efficiency of the operation [10, 33]. Our mesh decimation technique forces the vertex number, rather than the facet number, of each decimated mesh in a batch to be the same. Therefore, we apply batch normalization to vertex features for stable computation of the batch statistics. Combining the vertex2facet and facet2vertex convolutions, we can perform vertex2vertex convolution. Figure 4 illustrate the concept of vertex2facet and facet2vertex convolutions, along with a flow chart to elucidate the operations in vertex2vertex convolution.
Receptive field: All the convolutions we propose follow the natural organisation of vertex lists in the mesh facets. Therefore, the operations accumulate context from the firstorder neighborhood. This is similar to the receptive field reached by convolution in standard CNNs for image processing. To benefit feature learning from larger context, we apply multiple cascaded vertex2vertex convolutions in our network, which leads to a deeper neural network. Disconnected components:
Realworld meshes are not guaranteed to form a connected graph. It is probable that a mesh is composed of multiple disconnected components. This breaks the context aggregation between different components for mesh convolutions. On the other hand,
forcing the mesh to be connected can damage its geometric structure. In this work, we counter the potential issues of mesh connections using point cloud convolution, e.g. the fuzzy SPH3D [32]. By using range search [42], point cloud convolutions permit arbitrary respective fields. Finally, this strategy enables our feature learning to go beyond single connected components and learn features across multiple components. The PicassoNet: To explore the proposed modules, we design a Unet [50] like architecture for the popular semantic segmentation task. For consistency, we name our network as PicassoNet. The proposed network is composed of 5 mesh resolutions, indicating 5 hierarchical layers. We use convolution blocks to learn features within a single network resolution. Figure 5 shows different block configurations considered in this work. The vanilla block uses only the mesh convolutions, while the efficient block uses mesh convolutions accompanied by a pointbased convolution in parallel, before feature concatenation. PicassoNet based on the efficient block is able to process 1 million facets persecond at the inference time, still reporting competitive results. The generic block uses arbitrary number of meshbased convolutions with one pointbased convolution to extract features before the concatenation. Unlike DCMNet [52], we do not tune the ratio of feature channels between meshbased and pointbased convolutions. We use identical feature channels for both the convolutions, which correspond to a fixed ratio of 0.5. We use max mesh pooling to obtain downsampled features for the lowresolution networks. For efficiency, we apply the convolution blocks only in the encoder. Whereas in the decoder, we use convolutions and mesh unpooling to upsample the features, somewhat similar to the methods in [32, 59].PicassoNet takes 6dimensional input features , similar to many point cloud networks [32, 33, 59]. Although normals are readily available for both facets and vertices on meshes, we do not observe noticeable performance gain by using vertex normals as input features in our experiments. To investigate the modules presented in the Picasso library, we focus on the semantic segmentation of the realworld dataset S3DIS [2] using the PicassoNet. We choose S3DIS dataset due to the public availability of its training and testing sets. The color values are rescaled in the range before feeding into the network. Our network is trained on a single GeForce RTX 2080 Ti GPU with Adam Optimizer [25]
. For training, we adopt initial learning rate 0.001 with exponential decay. In specific, we decay the learning rate with a rate of 0.7 every 20K batch updates. Throughout the experiments, we use 18 Gaussian components whose centers are located regularly on the unit sphere for all facet2vertex convolutions. See supplementary for their specific values. The standard deviations of these Gaussians are universally initialized to 0.25, which is determined by the nearest center distances between different Gaussians. We switch off training of the Gaussian expectations on purpose because we observe that their values do not change too much in our experiments. Figure
6 provides an example to show the updates of the standard deviations of the Gaussians during the training.We use the fuzzy spherical kernel proposed in [32] to achieve pointbased convolution in our efficient and generic blocks. The kernel size and neighborhood search configurations are identical to [32]. Our network uses batch size in the training and takes point clouds of size as inputs. We exploit the widely used data augmentations in our experiments, including random flipping, shifting, random scaling, noisy translation, random azimuth rotation and arbitrary rotation perturbations. We also apply random color dropping to augment the vertex textures. The shuffling strategy we inserted in the mesh simplification also played a role of data augmentation. We apply these augmentations onthefly during the network training. Network Configuration: We use identical feature configurations for different convolution blocks shown in Fig. 5. In specific, the desired vertex sizes of meshes , , , , are , , , , respectively. To construct the neighborhoods for pointbased convolution, we use an increasing range search radius , , , , . The output feature sizes of the 5 hierarchical layers are 128, 128, 256, 256, 256, respectively. When meshbased and pointbased convolutions are both included, they each compute features of half the size of the defined output feature channels. For example, when the output feature channels are set to 128, they each compute features of 64 channels. This results in their concatenated features to have 128 channels. We set the multiplier of all convolutions to 1 except for the convolution performed on the inputs, which has . In the decoder, we explore mesh unpooling for feature interpolation. The output feature channels of the
convolutions are the same as their corresponding encoder features. PicassoNet also classifies the features obtained at resolution
in the decoder directly for efficiency.The Stanford largescale 3D Indoor Spaces (S3DIS) dataset [2] is a realworld dataset composed of dense 3D point clouds but sparse 3D meshes of 6 largescale indoor areas. The data is collected using Matterport scanner from three different buildings on the Standard campus. Semantic segmentation on this dataset is to classify the 13 defined classes, ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and clutter. We follow the standard training/test split by using Area 5 as the test set while the other areas as training set [30, 35, 43, 58, 63]
. The evaluation metrics for this dataset comprise the Overall Accuracy (OA), average Accuracy of all 13 classes (mAcc), classwise Intersection Over Union (IoU), together with their average (i.e. mIoU). mIoU is commonly considered a more reliable metric than the others. DCMNet
[52] used overtessellation and interpolation to produce high resolution meshes with labels from the original meshes in S3DIS. In contrast, we triangulate the labelled point cloud into triangular meshes using the algorithm from [1]. Our network takes meshes of size cropped from the room mesh as input data. The experiment results for different block configurations in the PicassoNet are provided in Table 1. It can be noticed that PicassoNet () provides competitive results to KPConv [59], and slightly outperforms SegGCN [32] and DCMNet [52]. Table 2 reports the training time of PicassoNet () for batch size 16, and its testing time for different batches. It can be seen that PicassoNet () can process over a million facets persecond on a commonly available single RTX 2080 Ti GPU.We summarize the statistics of vertex and facet sizes of room meshes we generate in S3DIS. The minimum, maximum, average of (vertex, facet) numbers are , , and . The standard deviations are as large as . Our fast mesh decimation method contributes largely to the efficiency of our network, e.g. million facets processed per second. We apply our simplification algorithm to all room samples in S3DIS to decimate them each into a mesh of 65536 vertices. We compare the speed of our algorithm to the quadric error metrics (QEM) method [17] that our method builds upon. The results are shown in Fig. 7 (left plot), in which the axis indicates input vertex sizes of the room meshes while the axis reports the runtime of the algorithms. Besides, we also test the runtime of our algorithm by decimating those room meshes into different resolutions, which corresponds to vertex sizes of 65536, 32768, 16384, 8192, 4096, 2048, 1024, 512 and 256. These results are reported in Fig. 7 (right plot). We can tell from the figure that the runtime of our algorithm is essentially linear in the size of vertices of the input mesh. These conclusions about the efficiency of our simplification algorithm are generic, as we observe similar results consistently on other mesh datasets also. See the supplementary for further results.
We provide a fast CUDAaccelerated mesh decimation technique to facilitate the 3D community to explore hierarchical neural architectures on meshes. We propose 3 novel convolutions, namely; vertex2facet, facet2facet, and facet2vertex. We provide CUDA implementations for forward and backward passes for these convolutions. We also introduce the vertex2vertex convolution on top of vertex2facet and facet2vertex convolutions. We use Gaussian mixtures and Barycentric interpolation to incorporate fuzziness into the proposed convolutions. Our mesh simplification gather vertex clustering information into a 1D tensor, which is convenient for the CUDAbased pooling and unpooling operations. Most importantly, we introduce Picasso, a selfcontained library for deep learning over 3D meshes and point clouds. We also present PicassoNet for semantic segmentation, and test its performance on S3DIS which provides at par results to stateoftheart. The efficient version of PicassoNet can process 1 million facets during inference while retaining competitive accuracy. We also observe that more mesh convolutions in the deeper version increases the receptive field, and hence learn better features. In the future, we will maintain and upgrade the introduced library for efficiency and further functionalities.
Acknowledgements: This work is supported by ARC Discovery Grant DP190102443.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1534–1543, 2016.Learning shape correspondence with anisotropic convolutional neural networks.
In Advances in neural information processing systems, pages 3189–3197, 2016.International Conference on Machine Learning
, 2015.. Citeseer, 2011.
Generating 3d faces using convolutional mesh autoencoders.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 704–720, 2018.We summarize the major modules included in Picasso Library as Fig. 8. All the novel operations introduced in this paper are colorized, while the previous point cloud based operations [32, 33] are left as blank. We also compare performance of the included convolutions for a very large input size, i.e. in Table 3. For meshbased convolutions, we use a batch of 16 meshes as input, whose vertex size each is 65536. For point based convolutions, we use only vertices of those meshes as input. Currently, 3D deep networks are widely taking data samples of 10000 points/vertices as input. For spatial graph convolutions, the graph construction based on neighborhood search takes significant amount of time, especially when the input size is large. For instance, see Table 10 of SPH3DGCN [33
]. We introduce meshbased modules to take advantage of its geodesic connections and save graph construction time. It is worth noting that we estimate the runtime under Tensorflow 2.
Convolution Type  batch size  input vetex/point size  kernel configuration  runtime (ms)  
facet2facet  16  3  6  8  125  
vertex2facet  16  3  6  8  89  
facet2vertex  16  12  48  2  195  
hard SPH3D [33]  16  48  2  118 + graph construction  
fuzzy SPH3D [32]  16  48  2  188 + graph construction 
The plane function of an arbitrary facet can be denoted as , where is the point in 3D space, is the facet normal, and is the intercept. The quadric of each facet is defined as . It associates a value named quadric error to an arbitrary point in space , which is computed as
(8) 
The quadric of each vertex is an accumulation of the quadric of its adjacent facets , represented as
(9) 
The quadric of each vertex cluster to be contracted is an accumulation of the quadric of all the vertices in it, that is
(10) 
The optimal vertex placement of each cluster after contraction is ideally computed as
(11) 
for which we determine if is a fullrank matrix by checking the reciprocal of its condition number. However, we still observe numerical instability in the decimated mesh. We therefore replace the computation of to be
(12) 
Open3D [ 71] computes in the same way. This improves the final results noticeably. See Table 4 for the performance comparison of PicassoNet () on S3DIS Area 5 using Eq. (11) and Eq. (12).
In the main paper, we present forward computations of the vertex2facet and facet2vertex convolutions in Eqs. (1), (2), (3) and Eqs. (4), (5), (6), respectively. In this section, we analyze their backward propagations in Eq. (13) and (14) correspondingly. The computations of facet2facet convolution are similar to those of vertex2facet convolution. We hence omit them here to avoid redundancy. We also provide the relate forward computations in Eq. (13) and 14) as references. Please refer to the main paper for the notations. Finally, as the GMMbased fuzzy coefficients are readily computed with builtin Tensorflow functions, there is no need for us to analyze their backward propagations manually because Tensorflow is able to achieve it automatically.
vertex2facet kernel:  (13)  
facet2vertex kernel:  (14) 
We report the 6 folds results of PicassoNet () on S3DIS dataset in Table 5. It can be noticed that the PicassoNet using only layers of mesh convolutions in each block are already competitive to KPConv [59] and DCMNet [52].
Originally, we trained the PicassoNet using cropped meshes generated from the fullresolution meshes provided in ScanNet dataset [12], which produced mediocre results. We found that the reason of this phenomenon is caused by highfrequency signals in noisy areas as well [52]. We hence voxelize the fullresolution meshes in ScanNet [12] using a grid size of , and reconducted the experiment. The performance of PicassoNet () on the full and voxelized validation set is reported in Table 6. Similar results are expected on the test set of ScanNet.
We summarize the statistics of vertex and facet sizes of the fullresolution meshes in ScanNet [12]. The minimum, maximum, average of (vertex, facet) number are , , and . The standard deviations are . Similarly, we apply the proposed simplification algorithm to decimate all room samples in ScanNet into a mesh of 65536 vertices. The runtime comparison of our algorithm and QEM are shown in the left plot of Fig. 9. We also test the runtime of our algorithm while decimating the room meshes into different resolutions, for which we use identical configurations to those for S3DIS. The results are shown in the right plot of Fig. 7. It can be noticed from the figure that our conclusions based on S3DIS dataset consistently hold for ScanNet.
Methods  Neural Element  Input Features  
Voxel  Point  SuperPoint  Geometric  Texture  
SPG [30]  –  –  ✓  [position, observation, geometrics]  
SSP+SPG [29]  –  –  ✓  [position, radiometry]  
SEGCloud [58]  ✓  –  –  occupancy  
PointNet [43]  –  ✓  –  
TangentConv [57]  –  ✓  –  [distance to tangent plane, height, normal]  
PointCNN [35]  –  ✓  –  
GACNet [62]  –  ✓  –  [height, eigenvalues] 

SPH3DGCN [33]  –  ✓  –  
KPConv [59]  –  ✓  –  
SegGCN [32]  –  ✓  –  
DCMNet [52]  –  ✓  –  
PicassoNet (Proposed)  –  ✓  – 
[1a] Francis Engelmann, Theodora Kontogianni, Alexander Hermans, and Bastian Leibe. Exploring spatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 716–724, 2017. [2a] Guohao Li, Matthias Müller, Ali Thabet, and Bernard Ghanem. DeepGCNs: Can GCNs go as deep as CNNs? In Proceedings of the IEEE International Conference on Computer Vision, 2019.