1 Introduction
In recent years, volumetric implicit representations have been at the heart of many 3D and 4D reconstruction approaches [kinfu, dynfu, relightables, m2f], enabling novel applications such as real time dense surface mapping in AR devices and freeviewpoint videos. While these representations exhibit numerous advantages, transmitting high quality 4D sequences is still a challenge due to their large memory footprints. Designing efficient compression algorithms for implicit representations is therefore of prime importance to enable the deployment of novel consumerlevel applications such as VR/AR telepresence [holo], and to facilitate the streaming of freeviewpoint videos [fvvbook].
In contrast to compressing a mesh, it was recently shown that truncated signed distance fields (TSDF) [CurlessL:96] are highly suitable for efficient compression [compression, KrivokucaKC:18] due to correlation in voxel values and their regular grid structure. Voxelbased SDF representations have been used with great success for 3D shape learning using encoderdecoder architectures [conf/cvpr/SitzmannTHNWZ19, Wang:2017:OOC:3072959.3073608]
. This is in part due to their grid structure that can be naturally processed with 3D convolutions, allowing the use of convolutional neural networks (CNN) that have excelled in image processing tasks. Based on these observations, we propose a novel blockbased encoderdecoder neural architecture trained endtoend, achieving bitrates that are
of prior art [compression]. We compress and transmit the TSDF signs losslessly; this does not only guarantee that the reconstruction error is upper bounded by the voxel size, but also that the reconstructed surface is homeomorphic – even when lossy TDSF compression is used. Furthermore, we propose using the conditional distribution of the signs given the encoded TSDF block to compress the signs losslessly, leading to significant gains in bitrates. This also significantly reduces artifacts in the reconstructed geometry and textures.Recent 3D and 4D reconstruction pipelines not only reconstruct accurate geometry, but also generate high quality texture maps, e.g. pixels, that need to be compressed and transmitted altogether with the geometry [relightables]. To complement our TSDF compression algorithm, we developed a fast parametrization method based on blockbased charting, which encourages spatiotemporal coherence without tracking. Our approach allows efficient compression of textures using existing imagebased techniques and removes the need of compressing and streaming UV coordinates.
To summarize, we propose a novel blockbased 3D compression model with these contributions:

[leftmargin=*,noitemsep]

the first deep 3D compression method that can train endtoend with entropy encoding, yielding stateoftheart performance;

lossless compression of the surface topology using the conditional distribution of the TSDF signs, and thereby bounding the reconstruction error by the size of a voxel;

a novel blockbased texture parametrization that inherently encourages temporal consistency, without tracking or the necessity of UV coordinates compression.
2 Related works
Compression of 3D/4D media (e.g
., meshes, point clouds, volumes) is a fundamental problem for applications such as VR/AR, yet has received limited attention in the computer vision community. In this section, we describe two main aspects of 3D compression: geometry and texture, as well as reviewing recent trends in learnable compression.
Geometry compression
Geometric surface representations can either be explicit or implicit. While explicit representations are dominant in traditional computer graphics [pmp, fvv], implicit representations have found widespread use in perception related tasks such as realtime volumetric capture [motion2fusion, kinfu, dynfu, dou2016fusion4d]. Explicit representations include meshes, point clouds, and parametric surfaces (NURBS). We refer the reader to the relevant surveys [AlliezG05, Peng_2005, maglo20153d] for compression of such representations. Mesh compressors such as Draco [draco] use connectivity compression [Rossignac_1999, MamouZP09] followed by vertex prediction [gotsmantoumagi98]. An alternate strategy is to encode the mesh as geometry images [GuGH02], or geometry videos [BricenoSMGH03] for temporally consistent meshes. Point clouds have been compressed by Sparse Voxel Octrees (SVOs) [JackinsT80, meagher_octtree], first used for point cloud geometry compression in [schnabel_2006]. SVOs have been extended to coding dynamic point clouds in [Kammerl2012] and implemented in the Point Cloud Library (PCL) [Rusu_3dis]. A version of this library became the anchor (i.e., reference proposal) for the MPEG Point Cloud Codec (PCC) [MekuriaBC:17]. The MPEG PCC standard is split into videobased PCC (VPCC) and geometrybased PCC (GPCC) [schwarz2018emerging]. VPCC uses geometry video, while GPCC uses SVOs. Implicit representations include (truncated) signed distance fields (SDFs) [CurlessL:96] and occupancy/indicator functions [poisson]. These have proved popular for 3D surface reconstruction [CurlessL:96, dynfu, DouTFFI15, dou2016fusion4d, m2f, compression, LoopCOC:16] and general 2D and 3D representation [Frisken:2000:ASD:344779.344899]. Implicit functions have recently been employed for geometry compression [KrivokucaCK:18, canhelas, compression], where the TSDF is encoded directly.
Texture compression
In computer graphics, textures are images associated with meshes through UV maps. These images can be encoded using standard image or video codecs [draco]. For point clouds, color is associated with points as attributes. Point cloud attributes can be coded via spectral methods [zhang_icip_2014, ThanouCF16, CohenTV16, QueirozC:17] or transform methods [QueirozC:16]. Transform methods are used in MPEG GPCC [schwarz2018emerging], and, similarly to TSDFs, have volumetric interpretation [ChouKK:18]. Another approach is to transmit the texture as ordinary video from each camera, and use projective texturing at the receiver [compression]. However, the bitrate increases linearly with the number of cameras, and projective texturing can create artifacts when the underlying geometry is compressed. Employing a UV parametrization to store textures is not trivial, as enforcing spatial and temporal consistency can be computationally intensive. On one end of the spectrum, Motion2Fusion [m2f] sacrifices the spatial coherence typically desired by simply mapping each triangle to an arbitrary position of the atlas, hence sacrificing compression rate for efficiency. On the other extreme, [prada17, relightables] take a step further by tracking features over time to generate a temporally consistent mesh connectivity and UV parametrization, therefore can be compressed with modern video codecs. This process is however expensive and cannot be applied to realtime applications.
Learnable compression strategies
Learnable compression strategies have a long history. Here we focus specifically on neural compression. The use of neural networks for image compression can be traced back to 1980s with autoencoder models using uniform [munro1989image]
or vector
[luttrell1988image] quantization. However, these approaches were akin to nonlinear dimensionality reduction methods and do not learn an entropy model explicitly. More recently toderici15rnn used a recurrent LSTM based architecture to train multirate progressive coding models. However, they learned an explicit entropy model as a separate post processing step after the training of recurrent autoencoding model. balle2017endproposed an endtoend optimized image compression model that jointly optimizes the ratedistortion tradeoff. This was extended by placing a hierarchical hyperprior on the latent representations to significantly improve the image compression performance
[hyperprior]. While there has been significant application of deep learning on 3D/4D representations,
e.g. [conf/cvpr/WuSKYZTX15, qi2016pointnet, Wang:2017:OOC:3072959.3073608, conf/cvpr/LiaoDG18, conf/cvpr/SitzmannTHNWZ19, conf/cvpr/ParkFSNL19], application of deep learning to 3D/4D compression has been scant. However, very recent works closely related to ours have used ratedistortion optimized autoencoders similar to [hyperprior] to perform 3D geometry compression endtoend: yan2019deep used a PointNetlike encoder combined with a fullyconnected decoder, trained to minimize directly the Chamfer distance subject to a rate constraint, on the entire point cloud. quach2019learning performs blockbased coding to obtain higher quality on the MVUB dataset [LoopQOC:17]. Their network predicts voxel occupancy using a focal loss, which is similar to a weighted binary cross entropy. In the most complete and performant work until now, wang2019learned also uses blockbased coding and predicted voxel occupancy, with a weighted binary cross entropy. They reported a 60% bitrate reduction compared to MPEG GPCC on the high resolution 8iVFB dataset [dEonHMC:17] hosted by MPEG, though they report only approximate equivalence with stateoftheart MPEG VPCC.In contrast, we use blockbased coding on even higher resolution datasets, and report bitrates that are at least three times better than MPEG VPCC, by compressing the TSDF directly rather than occupancy, yielding subvoxel precision.
3 Background
Our goal is to compress an input sequence of TSDF volumes encoding the geometry of the surface, and their corresponding texture atlases , which are both extracted from a multiview RGBD sequence [compression, relightables]. Since geometry and texture are quite different, we compress them separately. The two data streams are then fused by the receiver before rendering. To compress the geometry data , inspired by the recent advances in learned compression methods, we propose an endtoend trained compression pipeline taking volumetric blocks as input; see Section 4. Accordingly we also design a blockbased UV parametrization algorithm for texture ; see Section 5. For those unfamiliar with the topic and notation, we overview fundamentals of compression in the supplementary material.
4 Geometry compression
There are two primary challenges in endtoend learning of compression, both of which arise from the nondifferentiability of intermediate steps: \⃝raisebox{0.6pt}{1} compression is nondifferentiable due to the quantization necessary for compression; \⃝raisebox{0.6pt}{2} surface reconstruction from TSDF values is typically nondifferentiable in popular methods such as Marching Cubes [marchingcubes]. To tackle \⃝raisebox{0.6pt}{1}, we draw inspiration from the recent advances in learned image compression [hyperprior, balle2017end]. To tackle \⃝raisebox{0.6pt}{2}, we make the observation that Marching Cubes algorithm is differentiable with known topology.
Computational feasibility of training
The dense TSDF volume data for an entire sequence is very high dimensional. For example, a sequence from the dataset used in compression has 500 frames, with each frame containing voxels. The high dimensionality of data makes it computationally infeasible to compress the entire sequence jointly. Therefore, following compression, we process each frame independently in a block based manner. From the TSDF volume , we extract all nonoverlapping blocks of size that contain a zero crossing. We refer to these blocks as occupied blocks, and compress them independently.
4.1 Inference
The compression pipeline is illustrated in Figure 2. Given a block to be transmitted, the sender first computes the lossily quantized latent representation using the learned encoder with parameters . Next, the sender uses
to compute the conditional probability distribution over the TSDF signs as
, where is the ground truth sign configuration of the block, and are the learnable parameters of the distribution. The sender then uses an entropy coder to compute the bitstreams and by losslessly coding the latent code and signs using the distributions and respectively. Here is a learned prior distribution over parameterized by . Note that while the prior distribution is part of the model and known a priori both to the sender and the receiver, the conditional distribution needs to be computed by both. and are then transmitted to the receiver, which first recovers using entropy decoding with the shared prior . The receiver then recomputes in order to recover the losslessly coded ground truth signs . Finally, the receiver recovers the lossy TSDF values by using the learned decoder in conjunction with the ground truth signs as , where is the element–wise product operator, the element–wise absolute value operator, and the parameters of the decoder.To stitch the volume together, the block indices are transmitted to the client as well. Similar to [compression], the blocks are sorted in an ascending manner, and delta encoding is used to convert the vector of indices to a representation that is entropy encoder friendly. Once the TSDF volume is reconstructed, a triangular mesh can be extracted via marching cubes. Note that for the marching cube algorithm, the polygon configurations are fully determined by the signs. As we transmit the signs losslessly, it is guaranteed that the mesh extracted from the decoded TSDF will have the same topology as the mesh extracted from the uncompressed TSDF . It follows that the only possible reconstruction errors will be at the vertices that lie on the edges of the voxels. Therefore, the maximum reconstruction error is bounded by the edge length, i.e. the voxel size, as shown in Figure 3.
4.2 Training
We learn the parameters of our compression model by minimizing the following objective
(1) 
Distortion
We minimize the reconstruction error between the ground truth and the predicted TSDF values. However, directly computing the squared difference wastes model complexity on learning to precisely reconstruct values of TSDF voxels that are far away from the surface. In order to focus the network on the important voxels (i.e. the ones with a neighboring voxel of opposing sign), we use the ground truth signs. For each dimension, we create a mask of important voxels, namely , and . Voxels that have more than one neighbor with opposite signs appear in multiple masks, further increasing their weights. We then use these masks to calculate the squared differences for important voxels only for blocks.
Rate of latents
A second loss term we employ is , which is designed to reduce the bitrate of the compressed codes. This loss is essentially a differentiable estimate of the nondifferentiable Shannon entropy of the quantized codes ; see [balle2017end] for additional details.
Rate of losslessly compressed signs
Since contains only discrete values , it can be compressed losslessly using entropy coding. As mentioned above, we use the conditional probability distribution instead of the prior distribution . Note that the conditional distribution should have a much lower entropy than the priors, since is dependent on the by design. This allows us to compress the signs far more efficiently.
To make this dependency explicit, we add an extra head to the decoder, such that , and . The sign rate loss is then the cross entropy between the ground truth signs , with remapped to , and their conditional predictions . Minimizing has the effect of training the network to make better sign predictions, while also minimizing the bitrate of the compressed signs.
Encoder and Decoder architectures
Our proposed compression technique is agnostic to the choice of the individual architectures for the encoder and decoder. In this work, we targeted a scenario requiring a maximum model size of roughly 2MB, which makes the network suitable for mobile deployment. To limit the number of trainable parameters, we used convolutional networks, where both the encoder and the decoder consist of a series of 3D convolutions and transposed convolutions. More details about the specific architectures can be found in the supplementary material.
5 Texture compression
We propose a novel efficient and trackingfree UV parametrization method to be seamlessly combined with our blocklevel geometry compression; see Figure 2. As our parametrization process is deterministic, UV coordinates can be inferred on the receiver side, thus removing the need for compression and transmission of the UV coordinates.
Blocklevel charting
Traditional UV mapping either partitions the surface into a few large charts [zhou2004iso], or generates one chart per triangle to avoid UV parametrization as in PTEX [burley2008ptex]. In our case, since the volume has already been divided into fixedsize blocks during geometry compression, it is natural to explore blocklevel parametrization. To accommodate compression error, the compressed signal is decompressed on the sender side, such that both the sender and receiver have access to identical reconstructed volumes; see Figure 2 (left). Triangles of each occupied block are then extracted and grouped by their normals. Most blocks have only one group, while blocks in more complex areas (e.g. fingers) may have more. The vertices of the triangles in each group are then mapped to UV space as follows: \⃝raisebox{0.6pt}{1} the average normal in the group is used to determine a tangent space, onto which the vertices in the group are projected; \⃝raisebox{0.6pt}{2} the projections are rotated until they fit into an axisaligned rectangle with minimum area, using rotating calipers [Toussaint83solvinggeometric]. This results in deterministic UV coordinates for each vertex in the group relative to a bounding box for the vertex projections; \⃝raisebox{0.6pt}{3} the bounding boxes for the groups in a block are then sorted by size and packed into a chart using a quadtreelike algorithm. There is exactly one 2D chart for each occupied 3D block. After this packing, the UV coordinates for the vertices in the block are offset to be relative to the chart. These charts are then packed into an atlas, where the UV coordinates for the vertices are again offset to be relative to the atlas, i.e. to be a global UV mapping. After UV parametrization, color information can be obtained from either pervertex color in the geometry, previously generated atlas or even raw RGB captures. Our method is agnostic to this process.
Morton packing
In order to optimize compression, the blocklevel charts need to be packed into an atlas in a way that maximizes spatiotemporal coherence. This is nontrivial, as in our sparse volume data structure the amount and positions of blocks can vary from frame to frame. Assuming the movement of the subject is smooth, preserving the 3D spatial structure among blocks during packing is expected to preserve spatiotemporal coherence. To achieve this effect we propose a Morton packing strategy. Morton ordering [Morton66] (also called Zorder curve) has been widely used in 3D graphics to create spatial representations [lauterbach2009fast]. As our blocks are on a 3D regular grid, each occupied block can be indexed by a triple of integers . Each integer has a binary representation, e.g. , where . The 3D Morton code for is defined as the integer whose binary representation consists of the interleaved bits . Likewise, as our charts are on a 2D regular grid, each chart can be indexed by a pair of integers , whose 2D Morton code is the integer whose binary representation is . These functions are invertible simply by demultiplexing the bits. We map the chart for an occupied block at volumetric position to atlas position , where is the rank of the 3D Morton code in the list of 3D Morton codes, as illustrated in Figure 4 (left). Note that we choose to prioritize over and when interleaving their bits into the 3D Morton code, as is the vertical direction in our coordinate system, to accommodate typically standing human figures. Hence, as long as blocks move smoothly in 3D space, corresponding patches are likely to move smoothly in the atlas, leading to an approximate spatiotemporal coherence, and therefore better (video) texture compression efficacy.
6 Evaluation
To assess our method, we rely on the dataset captured by compression, which consists of six frames long RGBD multiview sequences of different subjects at Hz. We use three of them for training and the others for evaluation. We also employ “The Relightables” dataset by relightables, which contains higher quality geometry and higher resolution texture maps – three frame sequences. To demonstrate the generalization of learningbased methods, we only train on the dataset compression, and test on both compression and relightables.
6.1 Geometry compression
We evaluate geometry compression using two different metrics: the Hausdorff metric () [metro] measures the () worstcase reconstruction error via:
(2) 
where and are the set of points on the ground truth and decoded surface respectively. is the shortest Euclidean distance from a point to the surface . Another metric is the symmetric Chamfer distance ():
(3) 
For each metric, we compute a final score averaging all volumes, which we refer to as Average Hausdorff Distance and Average Chamfer Distance respectively.
Raw data  Naïve  Ours  
Avg. Size / Volume  155.1KB  139.8KB  2.9KB 
Signs
We showcase the benefit of our data dependent probability model on rate in Table 1. Raw sign data, though being binary, has an average size of KB per volume. With naïvely computed probability of signs being positive over the dataset, an arithmetic coder can slightly improve the rate to KB. This is because there are more positive TSDF values than negative in the dataset. With our learned, data dependent probability model, the arithmetic coder can drastically compress the signs down to KB per volume.
Topology Masking
To demonstrate the impact of utilizing ground truth sign/topology, we construct a baseline with a standard ratedistortion loss. Specifically, the distortion term is simplified as This baseline is shown as no topology mask in Figure 5. Without the error bound, its distortion is much higher than other baselines. The second baseline, in addition to using the same distortion term, losslessly compresses and streams the signs during inference, as described in Section 4. Despite the increased rate due to losslessly compressed signs, this baseline still achieves better ratedistortion tradeoff. Finally, using topology masking in both training and inference yields the best ratedistortion performance.
Ablation studies
The impact of network architecture on compression is evaluated in Figure 6. While having more layers leads to better results, there are diminishing returns. To keep the model size practical, we restricted our model to three layers (MB). We also perform ablation for the blocksize (voxels/block). Since in all volumes, the voxel size is mm, a block with blocksize has the physical size of . Note that increasing the size of each block reduces the number of blocks. Results show that if one has a budget of more than KB per volume, using block size yields much better ratedistortion performance. Therefore in the following experiments, layers with blocks is used.
Stateoftheart comparisons
We compare with stateoftheart geometry compression methods, including two volumetric methods: compression and JP3D [jpeg2000_jp3d]; two mesh compression: Draco [draco] and Free Viewpoint Video (FVV) [fvv]; as well as a point cloud compressor MPEG VPCC [schwarz2018emerging]. See their parameters in the supplementary material. For most of the methods, we sweep the rate hyper parameter to generate ratedistortion curves. The dataset [relightables] contains highresolution meshes (K vertices), which has a negative impact on the Draco compression rate. Hence, for Draco only, we decimate the meshes to 25K vertices termed as Draco (decimated) to make it comparable to other methods. Figure 7 shows that on both datasets, our method significantly outperforms all prior art in both rate and distortion. For instance, to achieve the same level of rate (marked with in Figure 7 (b)), the distortion of our method () is of compression (), and of Draco (decimated) () and MPEG (. To achieve the same distortion level (), our method (KB) only requires of the previous best performing method compression (KB).
To showcase difference in distortion, we select a few qualitative examples with similar rates, and visualize them in Figure 10: the Draco (decimated) results are lowres, the MPEG VPCC results are noisy, while the results of compression suffer blocking artifacts.
Efficiency
To assess the complexity of our neural network, we measure the runtime of the encoder and the decoder. We freeze our graph and run it using the Tensorflow C++ interface on a single NVIDIA PASCAL TITAN Xp GPU. Our range encoder implementation is singlethreaded CPU code, hence we include only the neural network inference time. We measure 20 ms to run
both encoder and decoder on all the blocks of a single volume.6.2 Texture compression
We compare our texture parametrization to UVAtlas [zhou2004iso]. In order to showcase the benefit of Morton packing, we also have a blockbased baseline where naïve bin packing is used without any spatiotemporal coherence, as shown in Table 2. To preserve the high quality of the target dataset [relightables], we generate highres texture maps (4096x4096) for all experiments. The texture maps of each sequence are compressed with the H.264 implementation from FFMpeg with default parameters. Perframe compressed sizes of different methods are reported to showcase how texture parametrization impacts the compression rate. In order to measure distortion, each textured volume with its decompressed texture atlas is rendered into the viewpoints of RGB cameras that were used to construct the volumes, and compared with the corresponding raw RGB image. For simplicity we only select 10 views (out of 58) where the subject face is visible. When computing distortion, masks are used to ensure only foreground pixels are considered, as shown in Figure 9.
Method  Rate  PSNR  SSIM  MSSSIM 

UVAtlas [zhou2004iso]  457  30.9  0.923  0.939 
Ours (Naïve)  529  30.9  0.924  0.939 
Ours (Morton)  350  30.9  0.924  0.940 
7 Conclusions
We have introduced a novel system for the compression of TSDFs and their associated textures achieving stateoftheart results. For geometry, we use a blockbased learned encoderdecoder architecture that is particularly well suited for the uniform 3D grids typically used to store TSDFs. To train better, we present a new distortion term to emphasize the loss near the surface. Moreover, ground truth signs of the TSDF are losslessly compressed with our learned model to provide an error bound during decompression. For texture, we propose a novel blockbased texture parametrization algorithm which encourages spatiotemporal coherence without tracking and the necessity of UV coordinate compression. As a result, our method yields a much better ratedistortion tradeoff than prior art, achieving distortion, or when distortion is fixed, bitrate of compression.
Future work
There are a number of interesting avenues for future work. In our architecture, we have assumed blocks to be i.i.d., and dropping this assumption could further increase the compression rate – for example, one could devise an encoder that is particularly well suited to compress “human shaped” geometry. Further, we do not make any use of temporal consistency in 4D sequences, while from the realm of video compression we know coding interframe knowledge provides a very significant boost to compression performance. Finally, while our perblock texture parametrization is effective, it is not included in our endtoend training pipeline – one could learn a perblock parametrization function to minimize screenspace artifacts.
References
8 Background on compression
Truncated Signed Distance Fields
A surface represented in TSDF implicit form is the zero crossing of a function
that interpolates a uniform
3D grid of truncated (and signed) distances from the surface. By convention, distances outside and inside the surface get positive and negative signs respectively, and magnitudes are truncated by a threshold value . Typically a method like marching cubes [marchingcubes] is used to determine the topology of each voxel (i.e. which voxel edges intersect with the surface), as well as the offsets of the intersection points for the valid edges, which are then used to form a triangular mesh.Lossless compression
The primary goal of general purpose lossless compression is to minimize the storage or transmission costs (typically measured in bits) of a discrete dataset . Each data point of is mapped to a variable length string of bits for storage or transmission by the sender. A receiver then inverts the mapping to recover the original data from the transmitted bits. The Shannon entropy provides an achievable lower bound on the rate, i.e. the minimum expected number of bits required to encode an element, where is the underlying distribution of . This is achievable by encoding to a bit string of length bits. Although this length is not necessarily an integer, it can be achieved arbitrarily closely on average by an arithmetic coder [cover2006]. With this encoding, the number of bits needed to code the entire dataset is
(4) 
where is referred to as the bit rate of the compression.
Lossy compression
In contrast, lossy compression methods can achieve significantly higher compression rates by allowing errors in the received data. These errors are typically referred to as distortion . In lossy compression there is a fundamental compromise between the distortion and the bit rate , referred to as ratedistortion trade–off, where distortion can be decreased by spending more bits. Minimizing subject to a constraint on leads to the following unconstrained optimization problem [ChouLG:89a, OrtegaR:98]
(5) 
where is a discrete lossy representation of and is a trade–off parameter. Higher values of result in better bit rates at the expense of increased distortion.
Lossy transform coding
Often is high dimensional, making the direct optimization of the problem above intractable. As a result, lossy transform coding is more commonly used instead. In lossy transform coding, a transformation is used to transform the original data into a latent representation and another is used to approximately recover the original data from the lossy latent representation . The transformations and , with parameters and , respectively, are typically chosen to simplify the conversion from to its lossy discrete version – a process called quantization. While and can be invertible transformations (e.g. the discrete cosine transform used for JPEG compression), in general they are not required to be. Thus, with , the original ratedistortion problem can be rewritten as
(6) 
where , , and the bit rate is , with as a probability model of with parameters that is learned jointly with . The code is converted to the corresponding variable length bit representation by entropy coding using the learned prior distribution .
Method  Rate Parameters (varied)  Fixed Parameters 

Ours  (for )  
compression  
Google Draco [draco] 


MPEG VPCC [schwarz2018emerging]  r configurations (for ) 

Quantization
Since the quantization operation is nondifferentiable, training such a network in an endtoend fashion is challenging. balle2017end propose simulating quantization noise during training rather than explicitly discretizing the code. Specifically, they quantize by rounding to nearest integer , which they model by adding of uniform noise during training, i.e. , to simulate quantization errors; see [balle2017end] for additional details.
9 Network architecture and training
We visualize the architecture of our model in Figure 11
, which is formed by a three layer encoder and decoder. While the architecture is similar to a convolutional autoencoder (implemented with convolutions in the encoder and transposed convolutions in the decoder), the main difference lies in the transformation the latent code goes through, and the additional losses that aim to minimize the bit rate as well as the reconstruction error, as visualized in
Figure 12. Specifically, we add uniform noise to the code during training to simulate quantization. At test time we quantize the code and compress it with an entropy coder. Additionally, the decoder has two final convolutional heads that separate the estimation of signs and the TSDF values. The one and two layer models we experiment with are similar with fewer layers.Figure 12 provides an overview of our training setup with the dependencies for the three terms in our training loss. Unlike a regular autoencoder which only aims to minimize the reconstruction error, we employ two additional losses and to minimize the bit rates for the compressed signals for the latent code and the ground truth signs. Additionally, instead of equally weighting each element the reconstructed , we use the ground truth signs to mask the voxels that have no neighboring voxels with opposing signs and have therefore less significance.
10 Baseline parameters
The parameters used in our experiments (Figure 6) are described in Table 3, except for JP3D [jpeg2000_jp3d] and FVV [fvv] which we obtained from compression. To generate a curve, we varied the corresponding rate parameter during inference, whilst keeping other parameters fixed as shown. Notations and definitions of parameters can be found in respective citations.