PCGCv1
Point Cloud Geometry Compression
view repo
This paper presents a novel endtoend Learned Point Cloud Geometry Compression (a.k.a., LearnedPCGC) framework, to efficiently compress the point cloud geometry (PCG) using deep neural networks (DNN) based variational autoencoders (VAE). In our approach, PCG is first voxelized, scaled and partitioned into nonoverlapped 3D cubes, which is then fed into stacked 3D convolutions for compact latent feature and hyperprior generation. Hyperpriors are used to improve the conditional probability modeling of latent features. A weighted binary crossentropy (WBCE) loss is applied in training while an adaptive thresholding is used in inference to remove unnecessary voxels and reduce the distortion. Objectively, our method exceeds the geometrybased point cloud compression (GPCC) algorithm standardized by wellknown Moving Picture Experts Group (MPEG) with a significant performance margin, e.g., at least 60 BDRate (Bjontegaard Delta Rate) gains, using common test datasets. Subjectively, our method has presented better visual quality with smoother surface reconstruction and appealing details, in comparison to all existing MPEG standard compliant PCC methods. Our method requires about 2.5MB parameters in total, which is a fairly small size for practical implementation, even on embedded platform. Additional ablation studies analyze a variety of aspects (e.g., cube size, kernels, etc) to explore the application potentials of our learnedPCGC.
READ FULL TEXT VIEW PDFPoint Cloud Geometry Compression
Point Cloud Geometry Compression
deeplearning based point cloud compression methods.
Point cloud is a collection of discrete points with 3D geometry positions and other attributes (e.g., color, opacity, etc), which can be used to represent the volumetric visual data such as 3D scenes and objects efficiently [1]. Recently, with the explosive growth of point cloud enabled applications such as 3D free viewpoint video and holoportation, highefficiency Point Cloud Compression (PCC) technologies are highly desired.
Existing representative standard compliant PCC methodologies were developed under the efforts from the MPEGI 3 Dimensional Graphics coding group (3DG) [2, 1], of which geometrybased PCC (GPCC) for static point clouds and videobased PCC (VPCC) for dynamic point clouds were two typical examples. Both GPCC and VPCC relied on conventional models, such as octree decomposition [3], triangulated surface model, regionadaptive hierarchical transform [4, 5], and 3Dto2D projection. Other explorations related to the PCC are based on graphs [6], binary tree embedded with quardtree [7], or recently volumetric model [8].
In another avenue, a great amount of deep learningbased image/video compression methods
[9, 10, 11] have emerged recently. Most of them have offered promising compression performance improvements over the traditional JPEG [12], JPEG 2000 [13], and even HighEfficiency Video Coding (HEVC) intra profilebased image compression [9, 11, 14]. These learned compression schemes have leveraged stacked DNNs to generate more compact latent feature representation for better compression [10], mainly for 2D images or video frames.Motivated by facts that redundancy in 2D images can be well exploited by stacked 2D convolutions (and relevant nonlinear activation), we have attempted to explore the possibility to use 3D convolutions to exploit voxel correlation efficiently in a 3D space. In other word, we aim to use proper 3D convolutions to represent the 3D point cloud compactly, mimicking the way that 2D image blocks can be well synthesized by stacked 2D convolutions [11, 14]. This paper focuses on static geometry compression, leaving other aspects (such as the compression of color attributes, etc) for our future study.
. “ReLU” stands for the Rectified Linear Unit.
A highlevel overview of our LearnedPCGC is given in Fig. (a)a, consisting of 1) a preprocessing for point cloud voxelization, scaling, and partition; 2) a variational autoencoder (VAE) based compression network; and 3) a postprocessing for proper voxel classification, inverse scaling, and extraction (for display and storage). Note that voxelization and extraction may be optional in the case that input point clouds are already in 3D volumetric presentation and not required to be stored in another nonvolumetric format.
Generally, PCG data is typically voxelized for a 3D volumetric presentation. Each voxel uses a binary bit (1 or 0) to represent whether the current position at (, , ) is occupied as a positive and valid point (and its associated attributes). An analogous example of a voxel in a 3D space is a pixel in a 2D image. A (down)scaling operation can be implemented to downsize input 3D volumetric model for better compression under a bit rate budget, especially at low bitrate scenarios. Corresponding (up)scaling is required at another end for subsequent rendering.
Inspired by the successful blockbased image processing, we propose to partition/divide the entire 3D volumetric model into nonoverlapped cubes^{1}^{1}1Each 3D cube is measured by its height, width and depth, similar as the 2D block represented by its height and width. , each of which contains voxels. The compression process runs cubebycube. In this work, operations are contained within the current cube without exploiting the intercube correlation. This ensures the complexity efficiency for practical application, by offering the parallel processing and affordable memory consumption, on a cubic basis.
For each individual cube, we use VoxceptionResNet [15]
(VRN) to exemplify the 3D convolutional neural network (CNN) for compact latent feature extraction. Similar as
[9, 10, 14], a VAE architecture is applied to leverage hyperpriors for better conditional context (probability) modeling when encoding the latent features. For an endtoend training, the weighted binary crossentropy (WBCE) loss is introduced to measure the compression distortion for ratedistortion optimization, while an adaptive thresholding scheme is embedded for appropriate voxel classification in inference.To ensure the model generalization, our learnedPCGC is trained using various shape models from ShapeNet [16], and is evaluated using the common test datasets suggested by MPEG PCC group and JPEG Pleno group. Extensive simulations have revealed that our LearnedPCGC exceeds existing MPEG standardized GPCC by a significant margin, e.g., 67%, and 76% BDRate (Bjontegaard delta bitrate) gains using D1 (point2point) distance, and 62%, and 69% gains using D2 (point2plane) distance, against GPCC using octree and trisoup models respectively. Our method also achieves comparable performance in comparison to another standardized VPCC. In addition to the aforementioned objective measures, we have also reported that our method could offer much better visual quality (e.g., smoother surface and appealing details) when rendered using the surface model. This is mainly due to the inherently 3D structural representation using learned 3D transforms. A fairly small net model (e.g., 2.5MB) is used, and parallel cube processing is also doable. This offers low complexity requirement, which is friendly for hardware or embedded implementation.
The contributions of this paper are highlighted as follows:
We have explored a novel direction to apply the learningbased framework, consisting of a preprocessing module for point cloud voxelization, scaling and partition, compression network for ratedistortion optimized representation, and a postprocessing module for point cloud reconstruction and rendering, to represent point clouds geometry using compact features with the stateoftheart compression efficiency;
We have exemplified objectively and subjectively the efficiency by applying stacked 3D convolutions (e.g., VRN) in a VAE structure to represent the sparse voxels in a 3D space;
Instead of directly using D1 or D2 distortion for optimization, a WBCE loss in training and adaptive thresholding in inference are applied to determine whether the current voxel is occupied, as a classical classification problem. This approach works well for exploiting the voxel sparsity.
Additional ablation studies have been offered to analyze a variety of aspects (e.g., partition size, kernel size, thresholding, etc) for our method to understand its capability for practical applications.
The rest of this paper is structured as follows: Section II reviews relevant studies on the compression of point clouds, learningbased image/video coding algorithms and recently emerged studies of autoencoders for point cloud processing; Our LearnedPCGC is given in Section III with systematic sketch and detailed discussions, followed by the experimental explorations and ablation studies to demonstrate the efficiency of our method; and concluding remarks are drawn in Section VI.
Relevant researches of this work can be classified as point cloud geometry compression, learned image compression, and recent emerging autoencoderbased point cloud processing.
Prior PCGC approaches mainly relied on conventional models, including octree, trisoup, and 3Dto2D projection based methodologies.
Octree Model. A very straightforward way for point cloud geometry illustration is using the octreebased representation [17] recursively. Binary labels (1, or 0) can be given to each node to indicate whether a corresponding voxel or 3D cube is positively occupied. Such binary string can be then compressed using statistical methods, with or without prediction [18, 19]. The octreebased approach has been adopted into the popular Point Cloud Library (PCL) [20], and referred to as benchmark solution extensively [21]. MPEG standard compliant GPCC [22] has also applied the octree coding mechanism, which is known as the octree geometry codec. Most octreebased algorithms have shown decent efficiency for sparse point cloud compression, but limited performance for dense point cloud compression.
Mesh/Surface Model. Mesh/surface model could be regarded as the combination of point cloud and fixed vertexface topology. Thus, an alternative approach is to use a surface model for point cloud compression, as investigated in [23, 24]. In these studies, 3D object surfaces are represented as a series of triangle meshes, where mesh vertices are encoded for delivery and storage. Point cloud after decoding is provided by sampling the reconstructed meshes. MPEG GPCC has also included such triangulationbased mesh model, a.k.a., triangle soups representation of geometry [22] into the test model. This is known as the trisoup geometry codec. Such trisoup model is preferred for a dense point cloud.
ProjectionBased Approach. Other attempts have tried to project the 3D object to multiple 2D planes from a variety of viewpoints. This approach can leverage existing and successful image and video codecs. The key issue to this solution are how to efficiently perform the 3Dto2D projections. As exemplified in MPEG VPCC [25]
, a point cloud is decomposed into a set of patches that are packed into a regular 2D image grid with minimum unused space. Padding is often executed to fill empty space for a piecewise smooth image. With such projection, point cloud geometry can be converted into 2D depth images that can be compressed using the HEVC
[26]. By far, VPCC has exhibited the stateoftheart coding efficiency compared with the GPCC and PCL, etc, for geometry compression.Recent explosive studies [27, 11, 10, 9, 28] have shown that learned image compression offers better ratedistortion performance over the traditional JPEG [12], JPEG2000 [13], and even HEVCbased Better Portable Graphics (BPG)^{2}^{2}2https://bellard.org/bpg/, etc [11, 14]
. These algorithms are mainly based on the VAE structure with stacked 2D CNNs for compact latent feature extraction. Hyperpriors are used to improve the conditional probability estimation of latent features. While endtoend learning schemes have been deeply studied for 2D image compression or even extended to the video
[28], there lack systematic efforts to study effective and efficient neural operations for 3D point cloud compression. One reason is that pixels in the 2D grid are more well structured and can be predicted via (masked) convolutions, but voxels in 3D cube present more sparsity, and unstructured local and global correlation, which is usually difficult for compression.Existing point cloud representation and generation models using autoencoders serve as good references for point cloud compression. For example, Achlioptas et al.[29] proposed an endtoend deep autoencoder that directly accepts point clouds for classification and shape completions. Brock et al.[15] introduced a voxelbased VAE architecture using stacked 3D CNNs for 3D object classification. Dai et al.[30] applied a 3DEncoderPredictor CNNs for shape completions, and Tatarchenko et al.[31] reported a deep CNN autoencoder for an efficient octree representation. These works are mainly developed for machine vision tasks but not for compression, but their autoencoder architectures provide references for us to represent 3D point clouds efficiently. Inspired by these studies, we try to design appropriate transforms using autoencoders for compact representation.
Quach et al. [32] proposed a convolutional transforms based PCG compression method recently, which is the most relevant literature to our work. When both compared with PCL, our work offers larger gains. This is mainly due to the fairly redundant features using shallow network structure with large convolutions, inaccurate context modeling of latent features, etc. We will show more details in subsequent ablation studies.
This section details each component designed in our LearnedPCGC, shown in Fig. (b)b, consisting of a preprocessing, an endtoend learning based compression network, and a postprocessing.
Voxelization. Point clouds may or may not be stored in its 3D volumetric representation. Thus, an optional step is converting its raw format to a 3D presentation, typically using a (, , )based Cartesian coordinate system. This is referred to as the voxelization. Given that our current focus is the geometry of point cloud in this work, a voxel at (, , ) is set to 1, e.g., , if it has positive attributes, and otherwise. Point cloud precision sets the maximum achievable value in each dimension. For instance, 10bit precision allows . PCG is referred to its volumetric representation throughout this paper unless pointed out specifically. With such a volumetric model for a PCG after voxelization, it captures intervoxel correlations in a 3D space, which is better for us to apply the subsequent 3D convolutions to exploit the efficient and compact representation.
Scaling. Image downscaling was used in image/video compression [33] to preserve image/video quality under a constrained bit rate, especially at a low bit rate. Thus, this can be directly extended to point clouds for better ratedistortion efficiency at the low bit rate range. On the other hand, scaling can be also used to reduce the sparsity for better compression by zooming out the point cloud, where the distance between sparse points gets smaller, and point density within a fixed size cube increases. As will be revealed in later experiments, applying a scaling factor in preprocessing leads to noticeable compression efficiency gains for sparse point cloud geometry, such as Class C, and yields wellpreserved performance at low bit rates for fairly dense Class A and B, shown in Fig. 5 and Table I.
In this work, we propose a simple yet effective operation via direct downscaling and rounding in advance. Let , be the set of points of the input point cloud. We scale this point cloud by multiplying with a scaling factor , , and round it to the closest integer coordinate, i.e.,
(1) 
Duplicate points at the same coordinate after rounding are simply removed for this study. An interesting topic is to exploring the adaptive scaling within the learning network. However, it requires substaintial efforts and is deferred as our study. On the other hand, applying the simple scaling operations in preprocessing is already demonstrated as an effective scheme as will be unfolded in later experimental studies.
Partition. Typically a point cloud geometry presents a large volume of data, especially for it with large precision. It is difficult and costly to process an entire point cloud at a time. Thus, motivated by the successful block based processing pipeline adopted in popular image/video standards, we have attempted to partition the entire point cloud into nonoverlapped cubes, as shown in Fig. 2. Each cube is at a size of .
The geometry position of each cube can be signaled implicitly following the raster scanning order from the very first one to the last one, regardless of whether a cube is completely null or not. Alternatively, we can specify the position of each cube explicitly using the existing octree decomposition method, leveraging the sparse characteristics of the point cloud. Each valid cube (e.g., with at least one occupied voxel) can be seen as a supervoxel at a size of . Thus, the number of supervoxel is limited, in comparison to the number of voxels in the same volumetric point cloud. As revealed in later ablation studies, signaling cube position explicitly using the existing octree compression method [20] only requires a very small percentage (e.g., <1%) overhead. In the meantime, the number of occupied voxels in each cube is also transmitted for later classificationbased point reconstruction. In summary, we treat the geometry position and the number of occupied voxels of each individual (and valid) cube as the metadata that is encapsulated in the compressed binary strings explicitly.
In the current study, each cube is processed independently without exploring their intercorrelations. Massive parallelism can be achieved by enforcing the parallel cube processing. Assuming the geometry position of a specific cube is , global coordinates of a voxel can be easily converted to its local cubic coordinates,
(2) 
for the following learningbased compression.
We aim to find a more compact representation of any input cube with sparsely distributed voxels. It mainly involves the pursuit of appropriate transforms via stacked 3D CNNs, and accurate rate estimation, and a novel classificationbased distortion loss measure for endtoend optimization.
3D Convolutionbased Transforms. Transforms are used for decades to represent the 2D image and video data in a more compact format, from the discrete cosine transform, to recently emerged learned convolutions based approaches. Especially, those learned 2D transforms have demonstrated promising coding performance in image compression [14, 10] via stacked CNNs based autoencoders, by exploring the local and global spatial correlations efficiently.
Thus, an extension is to design proper transforms based on stacked 3D CNNs to represent the 3D point cloud. In the encoding process, forward transform is analyzing and exploiting the spatial correlation. Thus it can be referred to as the “Analysis Transform”. Ideally, for any cube, we aim to derive compact latent features
, which are represented using a 4D tensor with the size of
. The analysis transform can be formulated as:(3) 
with for convolutional weights.
Correspondingly, a mirroring synthesis transform is devised to decode quantized latent features into a reconstructed voxel cube , which can be formulated as:
(4) 
with as its parameters.
Analysis and synthesis transformations are utilized in both main and hyper encoderdecoder pairs, shown in Fig. (b)b.
In this work, we use VoxceptionResNet (VRN) structure proposed in [15] as the basic 3D convolutional unit in the main codec, for its superior efficiency inherited from both residual network [34] and inception network [35]. The architecture of VRN is illustrated in Fig. 3. In the main codec, nine stacked VRNs are used for both analysis and synthesis transform.
Given that hyperpriors are mainly used for latent feature entropy modeling, we apply three consecutive lightweight 3D convolutions (with further downsampling mechanism embedded) instead of in hyper codec. Decoded hyperpriors are then used to improve the conditional probability of latent features from the main codec. Details regarding the entropy rate modeling are given in Section IIID.
In this work, we have applied relative small kernels for convolutions, e.g., or , which is then integrated with VRN model efficiently to capture the essential information for a compact representation. In the meantime, smaller convolution kernels are also implementation friendly with lower complexity.
A simple yet effective rounding operation is used for feature quantization in inference, i.e.,
(5) 
where and represent original and quantized representations respectively.
However, direct rounding
is not differentiable for backpropagation in the endtoend training scheme. Instead, we approximate the rounding process by adding uniform noise to ensure the differentiability,
(6) 
where is random uniform noise ranging from and , represents “noisy” latent representations with actual rounding error.
follows a uniform distribution
centered on : . Such approximation using added noise is also used in [10].Entropy coding is critical for source compression to exploit statistical redundancy. Among existing approaches, arithmetic code is widely used and adopted in standards and products because of its superior performance. Thus we choose the arithmetic coding to compress each element of quantized latent feature. Theoretically, the entropy bound of the source symbol (e.g., feature element) is closely related to its probability distribution, and more importantly, accurate rate estimation plays a key role in lossy compression for ratedistortion optimization
[36].We can approximate the actual bit rate of the quantized latent feature via
(7) 
with
as the self probability density function (p.d.f.) of
. Rate modeling can be further improved from (7) if we can have more priors. Thus, in existing learned image compression algorithms [10, 14], a VAE structure is enforced to have both main and hyper codecs. In hyper codec, dimensions of latent features are further downscaled to provide hyperpriors without the noticeable overhead. These hyperpriors are decoded as the prior knowledge for better probability approximation of latent feature when conditioned on the distribution of .Note that the same quantization process will be applied to both latent features and hyperpriors. Following the aforementioned discussion, we can model the decoded hyperpriors (e.g., with assumed uniform rounding noise) using a fully factorized model, i.e.,
(8) 
where represents the parameters of each univariate distribution . Therefore, a Laplacian distribution is used to approximate the p.d.f. of when conditioned on the hyperpriors, i.e.,
(9) 
The mean and variance parameters
of each element are estimated from the decoded hyperpriors.Ratedistortion optimization is adopted in popular image and video compression algorithms to tradeoff the distortion () and bit rate (). In our endtoend learning framework, we follow the convention and define the Lagrangian loss for training, so as to maximize the overall ratedistortion performance, i.e.,
(10) 
where controls the tradeoff for each individual bit rate.
Rate Estimation: In our VAE structure based compression framework, a total rate consumption comes from the and . Referring to (8) and (9), rate approximation can be written as
(11)  
(12) 
The total rate can be easily derive via the summation, e.g., . Here, rate spent by hyperpriors could be regarded as the side information or overhead, occupying merely less bits than the the latent representations in our design. Note that we only use hyperpriors for rate estimation, without including any autoregressive spatial neighbors [14, 9]. This is driven by the fact that voxels are distributed sparsely, thus, neighbors may not bring many gains in context modeling, but may break the voxel parallelism with large complexity.
Distortion Measurement: Existing image/video compression approaches use MSE or SSIM as the distortion measures. In this work, we have proposed a novel classificationbased mechanism to measure the distortion instead. Such classification method fits the natural principle to extract valid point cloud data after decoding. More specifically, decoded voxels in each cube usually present in a predefined range, e.g., from 0 to 1 in this work, from 0 to 255 if 8bit integer processing enforced. Recalling that each valid voxel in a point cloud geometry tells that this position is concretely occupied. Simple binary flag, “1” or “TRUE” often refers to the occupied voxel, while “0” or “FALSE” for the null or empty voxel. Therefore, decoded voxel needs to be classified into either 1 or 0 accordingly.
Towards this goal, we use a weighted binary crossentropy (WBCE) measurement as the distortion in training, i.e.,
(13) 
where is used in this work to enforce as the probability of being occupied, represents occupied voxels, represents null voxels, and , represent the numbers of occupied and null voxels, respectively. Note that we do not classify voxel into a fixed 1 or 0, but let
to guarantee the differentiability in backpropagation used in training. Different from the standard BCE loss that weights positives and negatives equally, we calculate the mean loss of positive and negative samples separately with a hyperparameter
to reflect their relative importance and balance the loss penalty. We set to 3 according to our experiments.Classification. In the inference stage, decoded voxels in each point cloud cube is presented as a floating number in (0,1), or an 8bit integer in (0, 255), according to the specific implementation. Thus, we first need to classify it into binary 1 or 0. A fixed threshold can be easily applied, for example, a median value , however, performance often suffers as shown in Fig. 14. Instead, we propose an adaptive thresholding scheme for voxel classification, according to the number of occupied points in the original point cloud cube. This information is embedded for each cube as the metadata. Since can be also referred to as the probability of being occupied, we sort to extract the top voxels, which are most likely to be occupied. Top selection fits the distortion criteria used in (13) for endtoend training, e.g., minimizing the WBCE by enforcing processed voxel distribution (i.e., occupied or null) close to the original input distribution as much as possible.
Detailed discussion is given in subsequent ablation studies.
Inverse Scaling. A mirroring inverse scaling with a factor of is implemented in postprocessing, in contrast to the scaling in preprocessing, when completing the inference of all cubes for rendering and display. This work applies a very simple linear scaling strategy. A complex scaling scheme could be used to retain reconstructed quality better, such as content adaptive scaling. This is an interesting topic to explore as our future study.
Extraction Extraction is an optional step in postprocessing, as the voxelization part in preprocessing. This part is used to convert 3D point cloud into another file format for storage or exchange, such as the ASCII or polygon file format (ply) used by the MPEG PCC group. For the scenarios that original point clouds are already in 3D volumetric presentation, or decoded point clouds are used for direct display, extraction is not necessarily required.
Following the above discussion, our LearnedPCGC runs iteratively for each cube in this work. It will encapsulate 1) cube position cube_pos, 2) the number of original occupied voxel num_occupied_voxel, 3) entropy coded features and hyperpriors, for each cube, into the binary bitstream for delivery and exchange. Here, we refer to part 1) and 2) as the metadata or (payload overhead), and 3) to as the main payload.
Metadata. For cube_pos, we simply use the octree model in [37] to indicate the location of the current cube in a volumetric point cloud. Since num_occupied_voxel is used for classification, we embed it directly here. For the worst case, we need bits for the cube at a size of . In practice, num_occupied_voxel might be much less than because of its sparse nature. Alternatively, we can signal another syntax element, such as max_num_occupied_voxel for the entire point cloud to bound the number of voxels in each cube then.
Payload As seen, both features and hyperpriors are encoded for the LearnedPCGC. Syntax elements for hyperpriors and latent features (in corresponding fMaps) are encoded consecutively using arithmetic coder. Context probability of hyperpriors is based on a fully factorized distribution, while context probability of latent features is conditioned on the hyperpriors.
Datasets. We randomly select 3D mesh models from the core dataset of ShapeNet [16] for training, including 55 categories of common objects. We sample the mesh model into point clouds by randomly generating points on the surfaces of the mesh. To ensure the uniform distribution of the points, we set the point density as when sampling each mesh surface. Fig. 4 shows some examples of these point clouds used for training from ShapeNet. These point clouds are then voxelized on a occupancy space. We randomly collect cubes from each voxelized point cloud, resulting in cubes in total used in this work.
Strategy.Loss function used for training is defined in Eq. (10). We set ratedistortion tradeoff from 0.75 to 16 to derive various models with different compression performance. During training, we first train the model at high bit rate by setting
to 16 and then use it to initialize model for lower bit rates. Applying the pretrained model from higher bit rates for transfer learning not only ensures faster convergence but also guarantees reliable and stable outcomes. The learning rate is set to
, and the batch size is set to 8. Training iteration executes more than batches for model derivation. Here, we use the Adam [38] to optimize the proposed network. We set its parameters and to 0.9 and 0.999, respectively.We apply trained models to do tests, aiming to validate the efficiency of our proposed method in subsequent discussions.
Testing Datasets. We choose three different test sets that are adopted by MPEG PCC [39] and JPEG Pleno [40] groups, to evaluate the performance of the proposed method, as shown in Fig. 5 and in Table I. These testing datasets present different structures and properties. Specifically, Class A (full bodies) exhibits smooth surface and complete shape, while Class B (upper bodies) presents noisy and incomplete surface (even having visible holes and missing parts). Another three inanimate objects in Class C have higher geometry precision but more sparse voxel distribution. Frames in I used for evaluation are also suggested by the MPEG PCC group.
Point Cloud  Points#  Precision  Frame#  

A  Loot  805285  10  1200 
Redandblack  757691  10  1550  
Soldier  1089091  10  690  
Longdress  857966  10  1300  
B  Andrew  279664  9  1 
David  330791  9  1  
Phil  370798  9  1  
Sarah  302437  9  1  
C  Egyptian Mask  272684  12   
Statue Klimt  499660  12    
Shiva  1009132  12   
Objective Comparison. We mainly compare our method with other PCGC algorithms, including 1) octreebased codec in Point Cloud Library (PCL) [20]; 2) MPEG PCC test model (TMs): TM13 for category 1 (static point cloud data), a.k.a., GPCC; and 3) MPEG PCC TM2 for category 2 (dynamic content), a.k.a., VPCC. Geometry model can be different in GPCC method using respective octree or trisoup representation. The former one is using the octree coding mechanism similar to the implementation in PCL, and the latter is based on triangle soup representation of the geometry. They are noted as GPCC (octree) and GPCC (trisoup), respectively.
For a fair comparison, we have tried to enforce the similar bit rate ranges for PCL, GPCC (octree), GPCC (trisoup) and our method. Such bit rate range is applied as suggested by the MPEG PCC Common Test Condition (CTC) [39].
For PCL, we use the OctreePointCloudCompression approach in PCLv1.8.0 [20] for geometry compression only. We set octree resolution parameters from 1 to 64 to obtain serial rate points.
For GPCC, the latest TM13v6.0 [37] is used with parameter settings following the CTC [39]. For GPCC (octree), we set positionQuantizationScale from 0.75 to 0.015, leaving other parameters as default. For GPCC (trisoup), we set tirsoup_node_size_log2 to 2, 3, 4, and positionQuantizationScale to 1 for Class A and B, and 0.125 or 0.25 for Class C^{3}^{3}3Downscaling is applied for Class C point cloud because they are typically sparse but with higher precision. .
Our LearnedPCC is trained in an endtoend fashion for individual bit rates by adapting and scaling factor .
Point Cloud  D1 (p2point)  D2 (p2plane)  
PCL  GPCC (octree)  GPCC (trisoup)  PCL  GPCC (octree)  GPCC (trisoup)  
A  Loot  91.50  80.30  68.58  87.50  73.49  68.91 
Redandblack  90.48  79.47  68.10  86.70  73.33  68.22  
Soldier  90.93  79.67  62.14  87.07  73.08  67.39  
Longdress  91.22  80.46  62.97  87.34  74.09  68.35  
Average  91.03  79.98  65.44  87.15  73.49  68.21  
B  Andrew  88.64  77.57  74.63  81.61  66.79  65.23 
David  87.56  75.25  72.23  82.52  68.13  66.95  
Phil  88.31  77.72  75.42  82.02  68.74  66.33  
Sarah  88.62  76.91  79.42  83.36  69.51  72.61  
Average  88.28  76.86  75.42  83.30  68.29  67.78  
C  Egyptian Mask  84.31  73.53  50.14  85.12  74.02  40.80 
Statue Klimt  –83.45  75.89  60.33  74.66  62.33  47.56  
Shiva  77.30  68.92  64.91  –67.89  56.42  51.85  
Average  81.68  72.78  58.46  75.89  64.25  46.73  
Overall average  87.48  76.88  67.17  82.34  69.08  62.20 
Objective comparison is evaluated using the BDRate, shown in Table II. There are two distortion metrics widely used for point cloud geometry compression. One is the meansquarederror (MSE) with pointtopoint (D1p2point) distance, and the other is the MSE with pointtoplane (D2p2plane) distance measurement [41, 42]. Bit rate is represented using bits per input point (bpp), or bits per occupied voxel (bpov).
Our method offers averaged 88% and 82% gains against PCL, 77% and 69% gains against GPCC (octree), 67% and 62% gains against GPCC (trisoup), measured via respective D1 and D2 based BDRate. Illustrative Ratedistortion curves are presented in Figs. 6, 7, and 8.
As reported previously, our LearnedPCGC exceeds current GPCC and PCL based geometry compression by a significant margin. For all testing Classes, e.g., dense or sparse voxel distributions, complete or incomplete surface, etc, the compression efficiency of our method consistently remains. On the other hand, our training dataset is the watertight point clouds generated from the ShapeNet [16], not directly covering the sparse voxel distribution as in Class C. However, our model still works by applying a simple scaling. All these observations justify the generalization of our method for various application scenarios.
In addition to the comparisons with those 3D model based geometry compression (e.g., PCL, GPCC in Table II), we have also extended the discussion to the projectionbased approach, e.g., VPCC.
We use the latest TM2v6.0 [43] to demonstrate the efficiency of standard compliant VPCC solution. For a fair comparison, we set the mode to “Allintra (AI)” and only compress the single frame of the dynamic point cloud, following the same test condition aforementioned [39]. We set a variety of quantization parameters (QP = 32, 28, 24, 20, and 16) to derive sufficient bit rates as well for coded geometry. Bit rates for geometry components (e.g., metadata, occupancy map, depth map [43]) are separated from the attributes for performance validation.
Again, our LearnedPCGC achieves comparable performance with VPCC based geometry compression, as shown in Fig. 9. BDRate improvements are further put in Table III. Results have shown that averaged +8.16% D1 BDRate loss but 4.31% D2 BDRate gains are captured. VPCC performs better on Loot and Longdress, while our LearnedPCGC works better on Redandblack and Soldier, as reported in Table III. Our analysis suggests that the more occluded region, the better compression efficiency of our LearnedPCGC. This is because our method inherently captures the voxel distribution in 3D space, regardless of occlusion or shape incompleteness that cannot be well exploited by the projectbased method.
Point Cloud  VPCC  

D1 (p2point)  D2 (p2plane)  
Loot  21.00  8.59 
Redandblack  8.99  21.87 
Soldier  3.84  7.47 
longdress  16.81  3.51 
Average  8.16  4.31 
Subjective Evaluation We show the decoded point clouds from different methods and the ground truth in Figs. 10, 11, and 12, we recommend zooming in to see the detail. To visualize the point clouds, we first compute the normal for each point using neighbor points, then we set parallel lighting in the front and render the points as Lambert unit. By this means, we could observe the detailed geometry which is more intuitive than vertexcolor rendered image. We also plot the error map based on the pointtopoint (P2point) D1 distance between decoded point clouds and ground truth to visualize the error distribution. We can see that our method preserves the detailed geometry and generates visually highquality point clouds. Though VPCC performs well in quantitative objective comparison, its reconstructed point clouds contain obvious seams as shown in the yellow dotted box, shown in zoomedin area of Fig. 10 and 11. This is because its method encodes point clouds by projecting them to different views, so it is difficult to avoid seams when fusing projected point clouds in the decoding phase. We also find that GPCC (tirsoup) codec may lose geometry details (e.g., visible holes shown in Figs. 10 and 11). The reconstructed point clouds of GPCC (octree) codec and PCL are much sparser as they could only retain much fewer points at comparative bit rate budget.
An interesting observation is that our reconstructed point cloud fills some broken parts in the ground truth PC. The broken part is produced due to incomplete or failed scans. We highlight the repaired part using the blue dotted box in Fig. 11. We think this is because we use ShapeNet [16] to generate the point cloud samples for training, where most of them are fine mesh models designed by CAD software. The highquality training data make the distortion of our reconstruction is inclined to complete and smooth shapes with lower noise. In contrast, the distortion of other methods is inclined to random noise.
We further extend our studies by examining various aspects of our LearnedPCGC, including the partition size, hyperpriors, adaptive thresholding, to demonstrate the robust and reliable performance of our method.
Partition Size. Analogous to the size of the coding tree unit in HEVC, we could set different partition sizes to explore its impact on the coding efficiency and implementation complexity for practice.
In the subsequent discussion, we have exemplified our studies using Loot at three different , i.e., 32, 64 and 128. Other testing materials share the similar outcomes. As illustrated in Fig. 13, BDRate gains about 20% from to , but almost keeps the same from to .
In addition to the BDRate, we have also provided other factors, e.g., the total number of cubes (cube#), metadata overhead (meta_bits), time (second) and memory consumption (mem) when executing the simulation, in Table IV. Time and memory consumption given here for processing each cube, are tested on a computer with an Intel i78700 CPU and a GTX1070 GPU (with 8G memory). All of these factors have substantial impacts on the algorithm complexity for implementation. For example, the smaller is , the better is parallel processing with less memory consumption and computational time. However, it comes with more blocky artifacts and BDRate sacrifice. Thus, a good choice of needs to balance the BDRate performance and implementation complexity. In this work, we choose .


cube#  meta_bits (bpp)  time  mem  

128  51  0.0015  0.78s  2208 MB 
64  212  0.0046  0.13s  414 MB 
32  790  0.0142  0.06s  252 MB 
Hyperpriors. Hyperpriors have been used for accurate conditional entropy modeling for image compression in [10, 14]. Here we further examine its efficiency in our LearnedPCGC.
Compared with the scenario only using a factorized entropy model for latent representations , hyperpriors could improve the context modeling and lead to better ratedistortion performance with the conditional probability exploration, yielding about 14.65% BDRate gains from our experiments.
Adaptive Thresholding. Thresholding mechanism is applied to classify decoded voxel into a binary decision (e.g., 1 or 0) for its occupancy state. We aim to find a threshold that leads to the minimum distortion (e.g., D1 or D2) for reconstruction.
A straightforward way is to set a naïve value, such as the 0.5 as the global threshold to do classification for all cubes. However, performance suffers. Instead, we propose order the decoded voxels and select top ones, e.g., = num_occupied_voxel, as the adaptive threshold, for each cube, leading to a noticeable BDRate gains in Fig. 14.
Since we are optimizing the top selection to minimize D1 or D2 for classification, we further deeply study whether adapting can bring more gains by finetuning. Similarly as illustrated in Fig. 14, adjusting for a finetuned , i.e.,
(14) 
would yield BDRate improvement. For example, on average, BDRate is gained about 8.9% with optimal in (14), or equivalent , when minimizing D1 distortion; while about 6.7% when minimizing D2 distortion. Optimal differs for D1 and D2 measures respectively, due to their fundamental variations in distance calculation, as visualized in Fig. 15 for “Loot” at 0.11 bpp. More voxels are selected for optimal D1 distortion measurement, e.g., =1.14, while less voxels, e.g., = 0.91 are used for better reconstruction for D2 distortion. It indicates that D2 measurement is more suitable for sparser point cloud.


Convolution Kernels. Our LearnedPCGC, including both main and hyper encoderdecoder pairs, requires 658,092 parameters in total for all embedded convolutions. In the current implementation, each parameter is buffered using 4byte floating format. It is about 2.52MB storage, which is fairly small onchip buffer requirement compared with other popular algorithms, such as AlexNet [44] with 60 Million parameters, or GoogleNet [45] using 4 Million parameters.
Our experiments have also revealed that current stacked VRN with small convolutional kernels and deep layers offer much better performance compared with an alternative approach using the shallow network with larger convolutional kernel sizes. This is mainly because that larger convolutions can not efficiently capture the spatial information due to sparse spatial distribution of voxels in a 3D space. But, deeper layers (with downscaling) offers an effective way to exploit correlation in a variety of scales.
A learningbased point cloud geometry compression method, socalled LearnedPCGC, is presented in this work, which consists of stacked 3D convolutions for the exaction of latent features and hyperpriors, a VAE structure for accurate entropy modeling of latent features, and a weighted BCE loss in training and an adaptive thresholding scheme in inference for correct voxel classification.
We have demonstrated the stateoftheart efficiency of proposed LearnedPCGC, for point cloud geometry compression, objectively, and subjectively, in comparison to those existing standardized methods, for example, over 62% and 67% BDRate gains over GPCC (trisoup), and over 69% and 76% BDRate gains over GPCC (octree), when the distortion is measured using D2 or D1 criteria respectively. On the other hand, our method also provides comparable compression efficiency against the projectionbased MPEG VPCC. Subjective evaluations have also evident the superior performance of our proposed method with noticeable perceptual improvements. Additional ablation studies deeply dive into a variety of aspects of our proposed method by carefully analyzing the performance and efficiency.
As for future studies, there are several interesting avenues to explore. For example, recent PointConV [46] might be borrowed to improve the efficiency of convolutions for the point cloud. On the other hand, traditional distortion measurements, such as D1 and D2, still suffer from a low correlation with subjective assessment. A better objective metric is highly desired.
We are deeply grateful for the constructive comments from anonymous reviewers to improve this paper.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 5868–5877.C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inceptionv4, inceptionresnet and the impact of residual connections on learning,” in
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.