DeepAI
Log In Sign Up

CARNet:Compression Artifact Reduction for Point Cloud Attribute

09/17/2022
by   Dandan Ding, et al.
0

A learning-based adaptive loop filter is developed for the Geometry-based Point Cloud Compression (G-PCC) standard to reduce attribute compression artifacts. The proposed method first generates multiple Most-Probable Sample Offsets (MPSOs) as potential compression distortion approximations, and then linearly weights them for artifact mitigation. As such, we drive the filtered reconstruction as close to the uncompressed PCA as possible. To this end, we devise a Compression Artifact Reduction Network (CARNet) which consists of two consecutive processing phases: MPSOs derivation and MPSOs combination. The MPSOs derivation uses a two-stream network to model local neighborhood variations from direct spatial embedding and frequency-dependent embedding, where sparse convolutions are utilized to best aggregate information from sparsely and irregularly distributed points. The MPSOs combination is guided by the least square error metric to derive weighting coefficients on the fly to further capture content dynamics of input PCAs. The CARNet is implemented as an in-loop filtering tool of the GPCC, where those linear weighting coefficients are encapsulated into the bitstream with negligible bit rate overhead. Experimental results demonstrate significant improvement over the latest GPCC both subjectively and objectively.

READ FULL TEXT VIEW PDF

page 1

page 9

10/18/2021

Patch-Based Deep Autoencoder for Point Cloud Geometry Compression

The ever-increasing 3D application makes the point cloud compression unp...
12/01/2021

Attribute Artifacts Removal for Geometry-based Point Cloud Compression

Geometry-based point cloud compression (G-PCC) can achieve remarkable co...
10/10/2017

Attribute Compression of 3D Point Clouds Using Laplacian Sparsity Optimized Graph Transform

3D sensing and content capture have made significant progress in recent ...
11/20/2021

Sparse Tensor-based Multiscale Representation for Point Cloud Geometry Compression

This study develops a unified Point Cloud Geometry (PCG) compression met...
08/26/2022

Efficient LiDAR Point Cloud Geometry Compression Through Neighborhood Point Attention

Although convolutional representation of multiscale sparse tensor demons...
10/15/2022

Motion estimation and filtered prediction for dynamic point cloud attribute compression

In point cloud compression, exploiting temporal redundancy for inter pre...

I Introduction

Point cloud is a collection of excessive number of points that are sparsely and irregularly distributed in the 3D space. It can flexibly and realistically represent a variety of 3D objects and scenes in many applications, such as Augmented Reality/Virtual Reality (AR/VR) and autonomous driving [17, 31]. Every valid point in a point cloud has its geometric coordinate, e.g., (

) in Cartesian coordinate system, and associated attribute component, such as the RGB color, normal, and reflectance. This work primarily deals with the color attribute. Unlike 2D images where pixels are uniformly-distributed and well structured for the characterization of spatial correlations easily, points in a 3D point cloud are unstructured, which makes it practically difficult to learn local neighborhood correlations for compact representation. This, in return, hinders the networked point cloud applications.

Abbreviation Description
BD-Rate Bjøntegaard Delta Rate [1]
CAR Compression Artifact Reduction
G-PCC Geometry-based PCC
LSE Least Square Error
MPEG Moving Picture Expert Group
MPSO Most-Probable Sample Offset
PCA Point Cloud Attribute
PCAC Point Cloud Attribute Compression
PCC Point Cloud Compression
TABLE I: Notations

To this end, since 2017, the Moving Picture Experts Group (MPEG) of International Standards Organization (ISO) has been intensively investigating and promoting potential technologies for high-efficiency point cloud compression (PCC), leading to the conclusion of two PCC specifications, a.k.a., Geometry-based PCC (G-PCC) and Video-based PCC (V-PCC) [4, 3, 14]. In V-PCC, a 3D point cloud sampled at a specific time instance is first projected to a set of perpendicular 2D planes; then a sequence of 2D planes consecutively spanning over a period of time are compressed using standard compliant video codecs like the prevalent HEVC (High-Efficiency Video Coding) [35] or the latest VVC (Versatile Video Coding) [2]. On the other hand, G-PCC directly compresses the 3D point cloud by separating its geometry and attribute components: the well-known octree model is often used for representing geometry coordinates, and Region-Adaptive Hierarchical Transform (RAHT) [7] and Hierarchical Prediction as Lifting Transform (PredLift) are selective options to compress the attribute information lossily [14].

I-a Background and Motivation

For the lossy compression of PCC using either V-PCC or G-PCC, compression artifacts like quantization-induced blockiness and blurriness are inevitable, which is annoying for perceptual appearance. Since the projection-based V-PCC adopts the HEVC or VVC as the compression backbone, rules-based in-loop filters [21] including deblocking, sample adaptive offset (SAO), and/or adaptive loop filter (ALF) adopted in video codecs already significantly mitigate compression artifacts. Besides, recently-emerged learning-based filters can be augmented on top of existing codecs as either a post-processing module [26, 16] or an in-loop function [9, 27] to further improve the quality of restored pixels (e.g., YUV or RGB). This work thus focuses on the compression artifact reduction (CAR) of G-PCC coded point cloud attribute (PCA). Similar to existing works [43, 33, 31], we assume the geometry information has been losslessly reconstructed when dealing with the lossy attribute compression.

Unfortunately, either rules-based or learning-based filters were seldom applied to reduce the attribute artifacts of G-PCC compressed point clouds. This is mainly because it is hard to effectively characterize and model attribute relations across sparsely and irregularly distributed points in a local neighborhood. As a naive comparison shown in Fig. 1, in an image, 111We assume the use of square window for neighborhood characteristic modeling. Other window shapes like diamond can be used as well [21, 37]. neighbors centered at a specific pixel can be used to model local neighborhood variations to develop a 2D ALF for artifact removal; similarly, for a 3D point cloud, neighbors centered at a specific point (or positively-occupied voxels222Since raw point clouds are voxelized for the compression using G-PCC, the term “point” in a raw point cloud is equivalent to the “positively-occupied voxel” in its voxelized version. Therefore this work often uses these two terms interchangeably.) can be potentially utilized to develop a proper 3D filter. On one hand, as enlarges, the complexity demanded for computing and caching elements increases much faster than that of processing elements, even for a simple convolutional operation. And more importantly, due to the sparse, unorganized, and irregular distribution nature, the occupancy status of neighboring voxels in a cube is nondeterministic and dynamic (see Fig. 1), which further complicates the characterization of attribute variations from spatial neighbors. Although devising an extremely large-scale network might be capable of learning such spatial variations, the complexity is accordingly unbearable for practical applications [33].

This work, therefore, develops an efficient attribute artifact reduction filter to overcome difficulties aforementioned and improve the quality of compressed PCA with low complexity consumption.

Fig. 1: Characterization of Spatial Variations Across Local Neighbors. (a) Neighbors in a square window of a pixel in a 2D image. (b) Potential neighbors in a cube of a positively-occupied voxel (or point) in a 3D point cloud. Grey cubes stand for positively-occupied voxels which are converted from the raw points by voxelizing the input point cloud [31]. Whether a voxel is occupied is nondeterministic and highly content-dependent because of the dynamic, sparse, and unstructured distribution of points in a point cloud.

I-B Approach

Thanks to the superior representation capability of deep learning technologies, we resort to Deep Neural Networks (DNNs) for the implementation of a learned 3D in-loop filter to restore the quality of compressed PCAs. It is referred to as the CARNet (Compression Artifact Reduction Network).

In principle, spatial coherency shall be maintained across neighborhood pixels in a 2D image or the attributes of neighboring points in a 3D point cloud even after compression, with which a visually pleasant appearance is ensured for content rendering [37]. In this regard, we propose to leverage spatial characteristics learned from a local neighborhood to develop the CARNet. Similar to those in-loop filtering studies for compressed 2D image/video [21, 37, 12]

, our CARNet also compensates the attribute distortion by estimating additive sample offsets, targeting to approach the original PCA input

as close as possible after the filtering process, i.e.,

(1)

where is the compressed PCA, and stands for the proposed CARNet adapted by a set of parameters . As seen, is compression-induced PCA distortion.

The derivation of is proceeded as follows:

  • For each point (or positively-occupied voxel) , we define the 3D local neighborhood centered at it as . Since point clouds are voxelized for G-PCC compression, we use a cubic to identify the range of in this work for simplicity. Here we only use positively-occupied voxels inside to learn local characteristics for . With this aim, we utilize sparse convolutions and stack them to build up sparse DNNs to aggregate valid local neighbors in . To simplify the implementation, the range of can be set the same as the receptive field of underlying sparse convolutions.

  • To thoroughly model the spatial coherency in a local neighborhood, we devise a two-stream network in the CARNet, where one stream applies sparse DNNs to directly characterize local variations spatially, and the other stream first separates high- and low-frequency components to embed frequency-dependent neighborhood relations and then concatenates them together for subsequent feature fusion.

  • To best estimate the distortion of a compressed PCA, we propose to generate multiple most-probable sample offsets (MPSOs) by aggregating two-stream features. These MPSOs are then linearly combined through the Least Square Error (LSE) optimization as in Eq. (1) to best approach the compression distortion. Note that the linear weighting coefficients are derived on-the-fly and encapsulated into bitstream with negligible bit rate consumption for each frame of input point clouds.

I-C Contribution

The main contributions of this paper are summarized below:

  • An efficient and low-complexity CARNet is developed to reduce the artifacts of G-PCC compressed point cloud attributes. Extensive results demonstrate that the CARNet brings 21.96% BD-Rate reduction to the latest G-PCC anchor across various common test point clouds recommended by the MPEG standardization committee. Relative to state-of-the-art MS-GAT [33] which is a post-processing solution for artifact removal, our CARNet provides 12.95% BD-Rate gains and costs much less runtime (about 30 speedup).

  • The efficiency of the CARNet comes from the effective characterization of neighborhood variations for the derivation of the most-probable sample offsets, which are then linearly weighted to produce the best additive offset for compression distortion compensation. The linear weighting coefficients are calculated on-the-fly through the guidance of original PCA. As such, we can best capture the dynamics of underlying content for better model generalization.

  • The lightweight computation of the CARNet is owing to the use of sparse convolutions to aggregate positively-occupied neighbors only within the convolutional receptive field. This not only best leverages the sparseness of valid points in a point cloud but also significantly increases the computational efficiency and reduces the time complexity, as has also been extensively studied for point cloud geometry processing [40, 39, 36].

Ii Related Work

This section reviews relevant studies on point cloud attribute compression and compression artifact reduction.

Ii-a Point Cloud Attributes Compression

The transform coding framework has been widely used for the compression of point cloud attribute [13, 15]. For example, Zhang et al. applied Graph Transform (GT) [43]

to compress color attributes on small graphs constructed by nearby points. However, GT is computationally expensive because it requires eigenvalue decomposition. Later, Ricardo

et al. [8] proposed Gaussian Process Transforms (GPTs), which are equivalent to Karhunen-Loéve Transforms (KLTs) of the Gaussian Process. More or less the same time, another Region-Adaptive Hierarchical Transform (RAHT) [7], which is basically an adaptive Haar wavelet transform, was developed. Note that RAHT was adopted into the MPEG G-PCC as the main tool for lossy attribute compression. Recently, the prediction of RAHT coefficients [34] was studied and included in the latest G-PCC reference software TMC13v14 with state-of-the-art compression efficiency reported. Besides, the hierarchical neighborhood Prediction as Lifting transform, termed PredLift, was also suggested in the G-PCC as the other lossy configuration of attribute coding.

Built upon recent advances in deep learning techniques, DNNs can also be used to facilitate the transform coding. Instead of applying handcrafted rules, data-driven learning is applied to derive (non-linear) transforms and context models directly for PCA compression. Among them, end-to-end supervised learning is the most straightforward solution 

[30]. Sheng et al. [32]

designed a point-based lossy attribute autoencoder, where stacked multi-layer perceptrons (MLPs) were used to extract spatial correlations across points and transform the input attribute into high-dimensional features for entropy coding. He 

et al. [19] introduced a density-preserving deep point cloud compression framework, which could be extended to support attribute compression. Wang et al. [41]

proposed an end-to-end sparse convolution-based PCA compression method, called SparsePCAC, for high-efficiency feature extraction and aggregation. Currently, although these end-to-end learning-based solutions demonstrate encouraging potentials, their compression performance is still inferior to the latest G-PCC reference model TMC13v14.

In addition to the end-to-end approach, Fang et al. [11] used an MLP-based model to replace the traditional entropy coding tool in G-PCC. They encoded RAHT coefficients using a neural model where side information including tree depth, weight, location, etc., was leveraged to estimate the probability of each coefficient for arithmetic coding.

The above rules-based and learning-based solutions all involve the quantization operation for attribute compression. This inevitably introduces annoying compression artifacts and leads to unsatisfied perceptual sensation, imposing the urgent requirement for artifact reduction and quality improvement of compressed PCAs.

Ii-B Compression Artifact Reduction Methods

Compression artifact reduction was extensively studied for restoring better reconstruction quality of compressed 2D images/videos, including both rules-based and learning-based methods, for either in-loop filtering or post-processing [28, 12, 37, 21, 27, 10, 26, 42, 24, 29, 20, 45, 38, 23]. Recently, learning-based solutions presented outstanding performance with remarkable quality improvement, which is mainly due to the use of powerful DNNs that effectively model spatial or spatiotemporal neighborhood characteristics for quality enhancement and artifact removal [9, 16].

In spite of abundant learning-based solutions for the quality enhancement of compressed 2D images/videos, we cannot intuitively extend them to support the quality enhancement of 3D PCAs. Unlike well-structured pixels in 2D images/videos, points in a pint cloud are sparsely and irregularly distributed. As a result, it is much more difficult to exploit the local neighborhood variations in 3D point clouds than in 2D images/videos. To tackle this problem, the MS-GAT [32], a pioneering and probably the only published exploration on the artifacts removal of G-PCC compressed PCAs, adopted graph convolution and graph attention layers instead of trivial MLPs for attribute correlation exploration. As reported in MS-GAT, using a 1.98 MB post-processing network achieved 10.76%, 6.14%, and 8.83% BD-Rate gains over the G-PCC for Y, U, and V components, respectively, at the expense of extremely high computational complexity. Even slicing the input point cloud to block patches with each patch having only 2048 points, its running speed is about 380 slower than the G-PCC anchor333Because the G-PCC and MS-GAT adopt different implementation platform and techniques, the runtime results only serve as the intuitive reference to have a general idea about the computational complexity. , according to the complexity report of MS-GAT. In this regard, an efficient-yet-lightweight compression artifact reduction approach is highly desired.

Fig. 2: The proposed Compression Artifact Reduction Network (CARNet). The CARNet consists of two processing phases: the MPSOs derivation and the MPSOs Combination. The term “MPSOs” stands for Most-Probable Sample Offsets. The first phase uses a two-stream network to derive multiple MPSOs as potential approximations of attribute compression distortion; subsequently, the second phase leverages least square error optimization to linearly combine MPSOs for attribute artifact compensation and quality improvement. The proposed two-stream network for the MPSOs derivation has two branches: one is for Frequency-Dependent Embedding and the other is for Direct Spatial Embedding, and the MPSOs are produced by fusing features from these two branches. “C”, “+”, and “-” indicate concatenate, sum, and subtract operations, respectively. SConv is the spatial convolution with kernel size and

channels. Simple ReLU is used for activation. The IRB (Inception Residual Block) stacks sparse convolutions with different depths for multi-level feature aggregation.

Iii The Proposed CARNet for Artifact Reduction of Compressed Point Cloud Attributes

Without loss of generality, a given point cloud is represented using a sparse tensor

, where = (, , ), stands for the collection of geometry coordinates, and = (, , ), represents associated color attributes. Following the convention used in existing works [43, 14], point cloud geometry coordinates are losslessly compressed as the prior knowledge to construct local neighborhood for the lossy compression of color attributes.

Iii-a Framework of the CARNet

As aforementioned, we propose the CARNet to solve Eq. (1). The CARNet is comprised of two consecutive phases: MPSOs Derivation and MPSOs Combination, as illustrated in Fig. 2. The MPSOs Derivation phase uses a two-stream network to derive a group of Most-Probable Sample Offsets as potential approximations of attribute compression distortion; and the MPSOs Combination phase linearly weights MPSOs to produce the best additive sample offset for distortion compensation of compressed PCAs.

Note that the derivation of MPSOs fully relies on the sparse DNNs to characterize and embed spatial variations in a local neighborhood under the spatial coherent assumption on attributes. The linear weighting coefficients of MPSOs are computed on-the-fly, further generalizing the CARNet to instantaneous PCA input.

First, the compressed three-channel PCA in YUV format, i.e., , is projected to a high-dimensional feature space as for subsequent processing using sparse convolution “SConv ” with kernel and channels. is the total number of valid points (or positively-occupied voxels) in this compressed PCA.

Since the significant difference between Y and U/V components, this work separately processes Y, U and V444The separate processing of Y, U, and V is also applied in MS-GAT [33].. When processing the U component, it is concatenated with the compressed Y component for filtering; similarly, both compressed Y and U components are concatenated with V for its enhancement. In this way, we basically leverage the cross-component processing for better performance [22]. As seen in Fig. 5, the same CARNet architecture is used but model parameters are adapted accordingly through the training of different color components. Next, we detail the CARNet for the Y component first, and then extend it to the processing of U/V components.

Iii-B MPSOs Derivation

Fig. 3: Inception Residual Block (IRB) used for efficient multi-level feature aggregation. Sparse convolution (SConv is stacked with kernel and channels. Simple ReLU is used for activation.

A Two-Stream network is developed to accept separately into two branches for the generation of a group of MPSOs to approximate the compression distortion.

Direct Spatial Embedding (DSE). One branch of the proposed two-stream network directly aggregates useful information from spatial neighbors in close proximity to model spatial variations. This generally assumes that the color attributes are more or less similar to each other across spatial neighbors nearby. To best characterize and embed spatial correlations of PCAs, we cascade three identical Direct Spatial Embedding units to generate high-dimensional feature tensors. Each each unit is comprised of a convolutional layer using a kernel, an activation layer using simple ReLU, and an Inception Residual Block (IRB). The IRB module is borrowed from our previous work [40] for its efficiency to aggregate information from multi-level representations. A typical example of the IRB is shown in Fig. 3. As seen, we stack sparse convolutional layers with different depth hierarchies, kernel sizes, and channel numbers for the aforementioned multi-level information embedding. Moreover, both the IRB and the Direct Spatial Embedding branch are augmented by residual link [18] for fast and robust aggregation of output features .

Frequency-Dependent Embedding (FDE). In addition to directly analyzing and embedding spatial characteristics, we separate the input to perform frequency-dependent embedding. This is because that compression-induced distortion usually presents different levels of artifacts across various frequency bands. For example, visually-annoying artifacts often reside in regions with high-frequency components of the input signal, such as object edges and contours, since low-frequency components can be well represented by transforms and/or predictions, while high-frequency components easily suffer from imperfect prediction and degradation in quantization. In this regard, we propose individual Low-Frequency Embedding and High-Frequency Embedding.

Low-Frequency Embedding (LFE). We apply a classical autoencoder shown in Fig. 4 to aggregate low-frequency information spatially. At the encoder, input features

first go through three Resolution Downsamplers for low-frequency information aggregation. Each Resolution Downsampler consists of a convolutional layer, a convolutional downsampling layer, a ReLU-based activation layer, and an IRB layer. For the convolutional downsampling layer, the sparse convolution with a stride of 2 is applied. As a result, at the bottleneck, the input

is squeezed into of its original spatial size with 64 channels as exemplified. The decoder then symmetrically applies the Resolution Upsampler to upscale features using transposed sparse convolutions “TSConv” for the generation of output features .

Fig. 4: Low-Frequency Embedding using an Autoencoder. The features are progressively downsampled by three Resolution Downsamplers at the encoder for information aggregation and accordingly upsampled by three Resolution Upsamplers at the decoder to recover the resolution. Each Resolution Downsamplers/Upsampler has four layers. The encoder and decoder are symmetrical: the downdsampling layer uses SConv and the upsampling layer uses TSConv . is the sampling stride. and are the same as other layers without re-sampling. We set at the bottleneck, assuming that low-frequency components are sufficiently embedded using such a setting.

High-Frequency Embedding (HFE). Previous Low-Frequency Embedding leverages sparse convolution-based re-sampling to capture low-frequency information. To further characterize spatial variations from high-frequency features, another High-Frequency Embedding is devised in parallel to the Low-Frequency Embedding, as illustrated in Fig. 2. For simplicity, average pooling and upsampling operations are consecutively conducted upon the input to generate ; and then is subtracted with to derive the high-frequency information . The average pooling operation applies the sparse convolution SConv with a stride of and kernel, i.e., = , and the upsampling operation applies with the transposed sparse convolution TSConv with a stride of accordingly, i.e., . It then leads to

(2)

In this work, we set and .

We subsequently concatenate the outputs of Low-Frequency Embedding and High-Frequency Embedding subbranches, i.e., and , and fuse them together using a Direct Spatial Embedding unit to produce the frequency-dependent feature tensor .

Feature Fusion for MPSOs. At the end of the two-stream network, both and generated from two branches are concatenated and processed using another residual Direct Spatial Embedding unit to produce a set of MPSOs as the distortion approximations of the compressed PCA.

Iii-C MPSOs Combination

In the MPSOs Combination phase, we propose to linearly weigh these derived MPSOs to estimate the final sample offsets for distortion compensation. As seen in Fig. 2, after fusing the latent features from two streams, a set of MPSOs defined as , are generated. Figure 2 exemplifies three MPSOs, dubbed , , and . The final estimated distortion is derived by combining these MPSOs linearly using coefficients as follows:

(3)

Then, the problem is how to effecitively approximate these linear weighting coefficients for the square error minimization defined in (1).

Let and be the original and compressed point cloud attributes, respectively. The attribute compression distortion is:

(4)

By combining (3) and (4), the problem in (1) is rewritten as

(5)

We then can approximate through the use of LSE optimization, i.e.,

(6)

where

stacks vectorized MPSOs and

D is available at the encoder.

These linear weighting coefficients are then explicitly signaled into the compressed bitstream. We can apply the proposed MPSOs Combination mechanism either at the frame level of the whole point cloud frame or at the block level by slicing the point cloud frame into fixed-size block patches. This work gives the frame-level example of signaling.

To encode , we first scale and clip each to a predefined integer range and then encode it using fixed-length entropy codes. As will be shown subsequently, three weighting coefficients are used in our implementation because of the outstanding BD-Rate gains according to our experimental results. These three coefficients are all scaled by 128, bounded in the range of [-16, 15], and represented using 5-bits codes thereafter.

Using such a limited number of weighting coefficients has almost no impact on the bit rate consumption of compressed PCA but benefits the reconstruction quality significantly. For example, three 5-bits coefficients only consume 15 bits in total per frame; given a point cloud that has 800,000 points, the bit rate cost of these three weighting coefficients is as small as 1.9E-5 bits per point (bpp), which is negligible to the bit rate consumption of G-PCC compressed attributes.

Iii-D Chroma Filtering Using Cross-Component CARNet

Previous sections detail the processing of Y component using the proposed CARNet. This section extends the CARNet to process U and V components. As has been verified in previous works [24, 44], there exist sufficient correlations across Y, U, and V components, which is so-called cross-component correlations. Such cross-component correlations are leveraged in this work for performance improvement. We propose a cascading cross-component strategy which uses the compressed Y attribute that contains rich texture details to help process the U component and both Y and U components to help process the V component. The specific network structures of U and V components are shown in Fig. 5.

Fig. 5: The Cross component strategy for U and V attribute modeling. We use the compressed Y attribute to assist the restoration of U component and use Y and U together to assist the restoration of V component.

Iv Experimental Results

Iv-a Experimental Settings

Fig. 6: R-D curves of the G-PCC and the proposed CARNet. The colorized ShapeNet dataset is used to train CARNet models. R-D curves measured in Y or YUV space are both provided.

Training and Testing Dataset. We use the ShapeNet [5] that consists of 10,000 3D models to build our training dataset. Following the method suggested by the SparsePCAC [41]

, we quantize the geometry coordinates of ShapeNet to 8-bit integers and randomly use color images chosen from COCO dataset 

[25] to paint point clouds in ShapeNet to generate their color attributes.

We compress these colorized point clouds using the latest G-PCC reference software TMC13v14555https://github.com/MPEGGroup/mpeg-pcc-tmc13 where their color attributes are processed using the default RAHT at four Quantization Parameters (QPs), including 34, 40, 46, and 51, to generate training samples. Currently, a separate model is trained for each QP value.

It is also worth pointing out that for fair comparisons with the MS-GAT [33], our CARNet is refined using the same training samples provided by the MS-GAT. For both the MS-GAT and the proposed CARNet, we assume the G-PCC as the compression backbone and keep the same G-PCC codec configuration for comparison. Following the conventions used in the G-PCC and MS-GAT, we process the point cloud attribute in YUV domain.

In the tests, we evaluate the proposed CARNet using 14 different point clouds widely used in standardization committees, including 5 from the Microsoft Voxelized Upper Bodies (MVUB, 9 bit), 5 from the 8i Voxelized Full Bodies (8iVFB, 10 bit), and 4 from the Owlii dynamic human mesh (Owlii, 11 bit). Notice that the training samples generated from colorized ShapeNet models are very different from the testing point clouds, revealing the generalization of the CARNet.

Training and Testing Settings.

Our project is implemented using PyTorch and MinkowskiEninge 

[6] on a computer with an Intel i7-8700K CPU, 32 GB memory, and Nvidia TITAN RTX GPU. The network model is optimized by Adam and parameters and

are set to 0.9 and 0.999, respectively. The learning rate decays from 1e-4 to 1e-5 every two epochs. We randomly initialize the model, and each model is trained on the dataset for up to 20 epochs. It takes around 70 hours for the model to converge on our platform.

The test strictly follows the MPEG common test condition (CTC). We use bpp (i.e., bits per point) to measure how many bits are required for each point and use PSNR (i.e.

, Peak Signal-to-Noise Ratio) to evaluate the restored quality of Y, U, V components individually and YUV jointly. The quality computation of individual color components and combined YUV follows the same procedure provided in G-PCC reference software for fair comparisons. And the bits consumption for three linear weighting coefficients is also included to derive the bpp. Bjøntegaard Delta bit rate (BD-Rate) 

[1] is used to measure the average Rate-Distortion (R-D) performance across four QP values.

Iv-B Performance Evaluation

Point Cloud Y U V YUV
MVUB
9-bit
andrew -15.30% -41.90% -29.08% -18.36%
david -18.35% -34.25% -34.41% -23.38%
phil -21.52% -30.44% -28.45% -23.16%
ricardo -19.52% -34.60% -23.69% -22.32%
sarah -22.10% -33.22% -33.50% -25.54%
8iVFB
10-bit
longdress -15.94% -23.02% -27.68% -19.20%
loot -16.63% -23.58% -27.57% -20.15%
redandblack -13.74% -29.92% -20.05% -17.38%
queen -23.50% -36.35% -35.06% -31.61%
soldier -17.13% -20.55% -30.02% -19.25%
Owlii
11-bit
basketball_player -21.62% -18.83% -19.50% -21.19%
exercise -20.41% -25.30% -26.52% -21.82%
dancer -20.87% -22.56% -25.44% -21.74%
model -17.54% -32.54% -38.41% -22.27%
Average -18.87% -29.07% -28.53% -21.96%
TABLE II: BD-Rate gains of the proposed CARNet against the G-PCC (TMC13v14) measured in respective Y, U, V, and YUV spaces

Iv-B1 Quantitative Measurement

Both the G-PCC and the MS-GAT are used as anchors to evaluate the proposed CARNet.

Compared with the G-PCC. We first compare the proposed CARNet to the G-PCC anchor using its latest reference TMC13v14. As shown in Table II, the proposed CARNet outperforms state-of-the-art G-PCC by 18.87%, 29.07%, 28.53%, and 21.96% BD-Rate gains in respective Y, U, V and YUV spaces. It is also observed that the CARNet consistently performs well for 9-bit, 10-bit, and 11-bit point clouds, although they have quite different sources and characteristics. The R-D curves are also visualized in Fig. 6.

Point Cloud G-PCC (TMC13v14) MS-GAT [33]
Y U V YUV Y U V YUV
MVUB
9-bit
andrew -14.10% -38.67% -22.72% -16.76% -6.53% -31.79% -9.19% -9.24%
david -18.36% -38.77% -28.82% -23.01% -9.47% -31.47% -18.88% -14.28%
phil -21.96% -27.51% -25.18% -22.85% -11.67% -22.05% -20.99% -13.58%
ricardo -18.58% -31.28% -22.50% -20.97% -10.61% -23.69% -12.25% -12.77%
sarah -22.32% -36.48% -27.61% -25.26% -11.35% -27.84% -16.09% -14.57%
8iVFB
10-bit
longdress -18.06% -22.29% -27.44% -20.42% -6.11% -20.39% -24.17% -11.53%
redandblack -15.96% -28.14% -19.43% -18.45% -6.14% -26.12% -15.41% -10.96%
soldier -20.21% -16.70% -28.71% -20.97% -8.90% -10.83% -19.02% -10.54%
Owlii
11-bit
dancer -23.14% -28.67% -29.76% -24.64% -13.05% -27.03% -25.42% -16.12%
model -19.47% -38.31% -40.71% -24.57% -9.44% -35.62% -34.48% -15.93%
Average -19.22% -30.68% -27.29% -21.79% -9.33% -25.68% -19.59% -12.95%
TABLE III: BD-Rate reduction of proposed CARNet to G-PCC and MS-GAT [33]. The CARNet is finetuned using the datasets suggested in MS-GAT.
Fig. 7: R-D curves of G-PCC, MS-GAT, and the proposed CARNet. The MS-GAT performance is obtained using its pre-trained models. Both Y and YUV R-D curves are provided to demonstrate the BD-Rate performance.

Compared with the MS-GAT [33]. We also compare our CARNet with the MS-GAT [33] - a post-processing method for G-PCC compressed attribute artifacts removal with state-of-the-art performance.

The MS-GAT selects five point clouds from the MPEG dataset for training (including “basketball_player”, “loot”, “exercise”, “queen”, and “boxer”). For fair comparisons, we build a training dataset using the same point clouds. Specifically, we partition each point cloud into 50000, 80000, and 100000 points by K-Dimensional (KD)-tree and finally obtain 648 training samples to refine the CARNet.

The performance of MS-GAT is obtained using its pretrained models. Although these MS-GAT models are trained using G-PCC reference model version 12, i.e., TMC13v12, we directly apply them on TMC13v14 compressed PCAs for testing because there is negligible performance difference between TMC13v12 and TMC13v14. We also test exactly the same point clouds used in MS-GAT.

Table III and Figure 7 compare the enhancement gains of the MS-GAT and the CARNet. It can be seen that refined by the MS-GAT’s dataset, the CARNet achieves 19.22%, 30.68%, and 27.29% BD-Rate gains over the G-PCC, and 9.33%, 25.68%, and 19.59% BD-Rate gains over the MS-GAT on the Y, U, and V component, respectively. For the measurement using compound YUV, the CARNet further offers superior performance to the MS-GAT, i.e., 12.95% BD-Rate improvement.

Iv-B2 Qualitative Visualization

We further visualize reconstructed PCA samples that are processed using the G-PCC, the MS-GAT, and the proposed CARNet for comparison in Fig. 8. Three examples are compressed by G-PCC using QP value 40. The reconstructed attributes from the G-PCC contain inevitable compression noise with noticeable blurriness and blockiness, for example, the face of “longdress” is severely distorted and exhibits obvious blockiness. There is also a serious color casting issue in G-PCC reconstructions, e.g., the soldier’s face.

Enhanced by the MS-GAT, the quality of reconstructed attributes is improved to some extent, yielding more smoothing and less noisy appearance. However, blurriness and blockiness are still presented, such as the leg area of “soldier”, the dress texture of “longdress”, and the face of “david”. Taking the “longdress” as an example, the lines and edges in the dress are blurry, especially the red line color is impaired apparently. By comparing the visualization results of G-PCC and MS-GAT, we find that the MS-GAT inherits considerable compression distortion from G-PCC, suggesting that these artifacts are not sufficiently removed and there is room for further improvement.

By contrast, our reconstructions demonstrate drastic visual improvement over previous methods. As seen, the faces of all three examples look more natural and smooth, and the color is closer to the ground truth; the edges and lines are well preserved, such as the areas on the soldier’s legs and the lady’s dress. Since the proposed CARNet optimizes the attribute based on the entire point cloud, the restored attribute uniformly looks more appealing and realistic.

Fig. 8: Qualitative visualization of reconstructions using G-PCC, MS-GAT, and proposed CARNet. The QP is 40 for G-PCC attribute compression. Three point clouds including soldier, longdress, and david are visualized from the top to bottom.
Point Cloud
G-PCC
Dec.
MS-GAT Proposed CARNet
Part& Y U V Y U V
Comb overall MD MC overall MD MC overall MD MC
MVUB 9-bit 0.48 0.15 16.61 16.42 16.18 1.12 0.75 0.37 1.11 0.74 0.37 1.17 0.79 0.38
8iVFB 10-bit 1.46 1.68 52.31 50.61 48.50 1.66 1.23 0.43 1.73 1.29 0.44 1.60 1.32 0.41
Owlii 11-bit 3.60 15.38 139.85 137.16 138.03 2.80 2.40 0.41 2.92 2.50 0.42 2.99 2.58 0.42
Average 1.40 3.66 51.97 50.82 50.25 1.63 1.22 0.39 1.68 1.26 0.40 1.71 1.30 0.40
TABLE IV: Runtime (in seconds) analysis and comparison. Both the MS-GAT [33] and the CARNet are run on the same GPU platform for runtime collection. Part&Comb stands for block patch “Partition and Combination” in the MS-GAT. MD and MC denote the MPSOs Derivation and the MPSOs Combination.

Iv-B3 Space and Time Complexity

The size of each CARNet model for enhancing the Y, U, or V component is around 2.6 MB; the MS-GAT model size is about 1.8 MB for each individual color component.

Runtime analysis is given in Table IV. Note that since both MS-GAT and CARNet run Python codes on GPU while the C/C++ implementation of G-PCC runs on CPU instead, the collected runtime is just served as an intuitive reference to have a general idea about the computational complexity.

As seen, in the CARNet, the averaged runtime for enhancing Y, U, and V attributes is about 1.67 seconds, which is in the same order of magnitude as that of the G-PCC compression. The processing of MPSOs derivation consumes about 80% time due to the use of stacked DNNs. The MS-GAT costs averaged 51.97, 50.82, and 50.25 seconds for Y, U, and V processing, respectively, on ten testing point clouds. Additionally, it requires 3.66 seconds for partition and combination to enforce block patch-based processing. Apparently, our CARNet speedup the MS-GAT by about 30 for the current implementation. In the future, block parallelism may be applied to accelerate the MS-GAT for faster computation.

Iv-B4 Discussion

As seen, the proposed CARNet effectively alleviates attribute artifacts in G-PCC compressed PCAs. It not only shows performance improvement to the G-PCC anchor but also presents a large margin to state-of-the-art MS-GAT. We believe the performance gains introduced by the CARNet mostly come from the effective modeling of spatial variations in a local neighborhood to derive potential distortion approximations, and the use of weighting coefficient to combine these approximations for optimal compensation. On the other hand, since global graph attention is used in the MS-GAT, it has to slice the large-scale point clouds into smaller patches for processing due to unaffordable memory consumption, which in return easily falls into the local optimization trap. Instead, the CARNet can process the entire large-scale point cloud for global optimization.

Moreover, the proposed CARNet attains about 30 speedup to process a compressed PCA in comparison to the MS-GAT. The high computational complexity of the MS-GAT is probably caused by the use of graph attention computation. Even for a patch with 2048 points, the graph attention layer still requires noticeable computational resources.

Additionally, the MS-GAT is a post-processing method while the CARNet is an in-loop filtering solution that explicitly transmits weighting coefficients in bitstream. As a result, the CARNet can better adapt itself to dynamic point cloud characteristics through the use of different coefficients. This is verified in Table III where the CARNet consistently offers performance gains across all test point clouds.

Iv-C Ablation Study

This section conducts a series of ablation studies to understand the contribution of each module in CARNet. We train all models using colorized ShapeNet in default. The comparison anchor is the latest G-PCC.

Two-Stream Network for MPSOs Derivation. We study the modular contribution of the two-stream network used for MPSOs derivation and results are reported in Table V. We consecutively disable the Frequency-Dependent Embedding (w/o FDE) stream and Direct Spatial Embedding stream (w/o DSE) to measure the performance loss. As seen, in comparison to the default CARNet, the performance loss is presented when disabling different modular components. For example, 4.2% and 3.2% BD-Rate increases are captured for “w/o FDE” and “w/o DSE”, clearly showing the effectiveness of these two modules. More loss is observed by disabling the Frequency-Dependent Embedding (i.e., w/o FDE), suggesting that it is critical to model distortion artifacts across regions with different spatial frequencies.

Point Cloud w/o FDE w/o DSE Proposed
MVUB
9-bit
andrew -11.14% -14.57% -15.30%
david -14.91% -14.19% -18.35%
phil -16.65% -18.88% -21.52%
ricardo -13.76% -16.42% -19.52%
sarah -17.53% -18.12% -22.10%
8iVFB
10-bit
longdress -12.82% -12.82% -15.94%
loot -11.76% -14.07% -16.63%
redandblack -10.05% -11.13% -13.74%
queen -21.98% -19.50% -23.50%
soldier -15.44% -14.97% -17.13%
Owlii
11-bit
basketball_player -14.47% -16.52% -21.62%
exercise -14.07% -16.24% -20.41%
dancer -16.14% -17.16% -20.87%
model -13.80% -14.30% -17.54%
Average -14.60% -15.63% -18.87%
TABLE V: BD-Rate contribution of modular components in the two-stream network used for MPSOs derivation. Y BD-Rate is exemplified.
Point Cloud w/o MPSOs w/ MPSOs Combination
Combination 1 weighting coefficient 2 weighting coefficients 3 weighting coefficients 4 weighting coefficients
BD-Rate bpp BD-Rate bpp BD-Rate bpp BD-Rate bpp
MVUB
9-bit
andrew -7.64% -14.03% 2.04E-05 -13.34% 3.80E-05 -15.3% 5.57E-05 -14.65% 7.30E-05
david +0.56% -16.96% 1.59E-05 -17.13% 3.40E-05 -18.35% 4.74E-05 -16.9% 6.30E-05
phil -5.34% -20.89% 1.51E-05 -20.76% 2.80E-05 -21.52% 4.27E-05 -20.07% 5.60E-05
ricardo +1.65% -17.16% 2.69E-05 -17.96% 4.90E-05 -19.52% 7.30E-05 -19.98% 1.00E-04
sarah -3.09% -20.28% 2.16E-05 -20.12% 3.50E-05 -22.1% 5.20E-05 -21.73% 7.46E-05
8iVFB
10-bit
longdress -5.88% -14.74% 7.46E-06 -14.82% 1.30E-05 -15.94% 2.10E-05 -15.44% 2.70E-05
loot -1.07% -15.76% 6.21E-06 -15.62% 1.24E-05 -16.63% 1.86E-05 -15.50% 2.48E-05
redandblack -2.09% -13.47% 2.00E-06 -13.79% 1.40E-05 -13.74% 2.17E-05 -14.22% 2.80E-05
queen -4.67% -22.33% 5.00E-06 -22.83% 1.01E-05 -23.50% 1.50E-05 -23.79% 1.99E-05
soldier -3.93% -15.79% 5.47E-06 -16.05% 1.00E-05 -17.13% 1.42E-05 -16.68% 2.00E-05
Owlii
11-bit
basketball_player +1.00% -19.99% 1.71E-06 -20.28% 3.42E-06 -21.62% 5.13E-06 -21.09% 6.84E-06
exercise +2.87% -18.86% 2.09E-06 -19.02% 4.18E-06 -20.41% 6.27E-06 -20.09% 8.36E-06
dancer -0.99% -21.46% 2.00E-06 -20.27% 4.00E-06 -20.87% 6.00E-06 -21.46% 8.00E-06
model -1.88% -17.82% 2.07E-06 -17.21% 4.08E-06 -17.54% 6.12E-06 -17.82% 8.16E-06
Average -2.18% -17.83% 1.47E-05 -17.76% 1.85E-05 -18.87% 2.75E-05 -18.49% 3.70E-05
TABLE VI: Efficiency of the MPSOs Combination. Y BD-Rate is measured for analysis. Bit rate overhead for compressing weighting coefficients is measured in bpp.

MPSOs Combination Mechanism. The MPSOs Combination using linear weighting is first examined by using a conventional single-channel output. As such, no weighting coefficient is encoded and transmitted. As shown in Table VI, the option of “w/o MPSOs Combination” gains only 2.18% on the Y component, which is much lower than the default CARNet with MPSOs Combination (see Table II). More importantly, BD-rate increases are shown for point clouds like “david”, “ricardo”, “basketball_player, and “exercise”. All of these clearly reveal that the MPSOs Combination can well capture content dynamics for better generalization.

Moreover, we also investigate the enhancement performance of the CARNet using a different number of linear weighting coefficients (or equivalently, a different number of MPSOs). For example, in Table VI, we report the results of using 2, 3, and 4 weighting coefficients. In general, using 3 and 4 coefficients have comparable gains, which is better than using 1 or 2 coefficients. In this paper, we hence adopt the use of 3 weighting coefficients. Besides, the bit cost of weighting coefficients is reported. It is obvious that the average bit cost of coefficients is negligible, implying almost zero impact on the overall bit rate consumption of compressed PCAs.

Cross-Component Strategy to Train Separate CARNet Models for Chroma Components. In Table VII, we report the results of training a joint CARNet model for Y, U, and V components. Since the Y, U, and V components possess very diverse characteristics and distribution, a joint model may compromise the performance to some extent. This is confirmed by our experiments: the BD-Rate gain of the joint model of Y, U, and V is decreased to 10.64% in average. On the contrary, the proposed CARNet which trains Y, U, and V separately with the cross-component strategy for chroma attributes achieves 21.96% BD-Rate reduction.

Point Cloud Joint Model Separate Model
MVUB
9-bit
andrew -11.32% -18.36%
david -11.96% -23.38%
phil -15.5% -23.16%
ricardo -11.71% -22.32%
sarah -15.08% -25.54%
8iVFB
10-bit
longdress -9.05% -19.20%
loot -5.72% -20.15%
redandblack -8.35% -17.38%
queen -13.78% -31.61%
soldier -4.51% -19.25%
Owlii
11-bit
basketball_player -9.42% -21.19%
exercise -7.94% -21.82%
dancer -12.46% -21.74%
model -12.11% -22.27%
Average -10.64% -21.96%
TABLE VII: Joint YUV model versus Separate Y, U, V models. Cross-component strategy is applied to train the CARNet for U and V. BD-Rate is measured in YUV space.

G-PCC PredLift configuration. We further provide the performance of CARNet when using the PredLift mode for lossy G-PCC compression. We apply exactly the same method in Section IV-A to build the training dataset for deriving PredLift models. The results are shown in Table VIII. As seen, the proposed CARNet consistently outperforms the G-PCC by 16.53% and the MS-GAT by 6.47% BD-Rate, further demonstrating its generalizability for different compression settings.

Point Cloud G-PCC (TMC13v14) MS-GAT
MVUB
9-bit
andrew -10.63% -5.38%
david -14.4% -6.58%
phil -23.85% -11.5%
ricardo -16.42% -8.15%
sarah -19.32% -7.31%
8iVFB
10-bit
longdress -15.17% -2.69%
redandblack -12.59% -4.98%
soldier -14.92% -3.68%
Owlii
11-bit
dancer -20.56% -9.09%
model -17.45% -5.3%
Average -16.53% -6.47%
TABLE VIII: BD-Rate (Y) gain of proposed CARNet compared with the G-PCC (TMC13v14) and MS-GAT [32] in PredLift mode

V Conclusion

The Standardized G-PCC compactly represents the geometry coordinates and color attributes of underlying point clouds with a significant reduction of data volume at the expense of visually annoying compression artifacts. This work thus develops a learning-based adaptive in-loop filter for G-PCC, termed CARNet, to effectively alleviate attribute artifacts for G-PCC compressed point clouds. The CARNet first leverages a two-stream network to generate a group of Most-Probable Sample Offsets (MPSOs) as compression distortion approximations and then linearly combines these MPSOs to best compensate the distortion. The linear weighting coefficients, which are estimated on the fly through the least square error optimization guided by the original input attribute, ensure optimal distortion compensation. In our implementation, we leverage sparse convolution to construct the CARNet due to the superior performance of sparse tensor in representing unorganized, irregular points. Experimental results show that the proposed CARNet achieves 21.96% and 12.95% BD-Rate reduction against the G-PCC and state-of-the-art MS-GAT method. Ablation studies further provide evidence for the efficiency and generalization of the proposed method in various content and compression settings.

Vi Acknowledgement

Our sincere gratitude is directed to the authors of MS-GAT [33] for making the pretrained models publicly accessible, which greatly helps us to do comparative studies.

References

  • [1] G. Bjøntegaard (2001-04) Calculation of average PSNR differences between rd-curves. In ITU-T SG 16/Q6, 13th VCEG Meeting, Cited by: TABLE I, §IV-A.
  • [2] B. Bross, J. Chen, J. Ohm, G. J. Sullivan, and Y. Wang (2021) Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proceedings of the IEEE 109 (9), pp. 1463–1493. External Links: Document Cited by: §I.
  • [3] C. Cao, M. Preda, and T. Zaharia (2019) 3D point cloud compression: A survey. In The 24th International Conference on 3D Web Technology, pp. 1–9. Cited by: §I.
  • [4] C. Cao, M. Preda, V. Zakharchenko, E. S. Jang, and T. Zaharia (2021) Compression of sparse and dense dynamic point clouds—methods and standards. Proceedings of the IEEE 109 (9), pp. 1537–1558. Cited by: §I.
  • [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §IV-A.
  • [6] C. Choy, J. Gwak, and S. Savarese (2019)

    4D spatio-temporal convnets: Minkowski convolutional neural networks

    .
    In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3075–3084. Cited by: §IV-A.
  • [7] R. L. De Queiroz and P. A. Chou (2016) Compression of 3D point clouds using a region-adaptive hierarchical transform. IEEE Transactions on Image Processing 25 (8), pp. 3947–3956. Cited by: §I, §II-A.
  • [8] R. L. De Queiroz and P. A. Chou (2017) Transform coding for point clouds using a Gaussian process model. IEEE Transactions on Image Processing 26 (7), pp. 3507–3517. Cited by: §II-A.
  • [9] D. Ding, X. Gao, C. Tang, and Z. Ma (2021) Neural reference synthesis for inter frame coding. IEEE Transactions on Image Processing 31, pp. 773–787. Cited by: §I-A, §II-B.
  • [10] C. Dong, Y. Deng, C. C. Loy, and X. Tang (2015) Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE international conference on computer vision, pp. 576–584. Cited by: §II-B.
  • [11] G. Fang, Q. Hu, H. Wang, Y. Xu, and Y. Guo (2022) 3DAC: Learning attribute compression for point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14819–14828. Cited by: §II-A.
  • [12] C. Fu, E. Alshina, A. Alshin, Y. Huang, C. Chen, C. Tsai, C. Hsu, S. Lei, J. Park, and W. Han (2012) Sample adaptive offset in the HEVC standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1755–1764. Cited by: §I-B, §II-B.
  • [13] V.K. Goyal (2001) Theoretical foundations of transform coding. IEEE Signal Processing Magazine 18 (5), pp. 9–21. External Links: Document Cited by: §II-A.
  • [14] D. Graziosi, O. Nakagami, S. Kuma, A. Zaghetto, T. Suzuki, and A. Tabatabai (2020) An overview of ongoing point cloud compression standardization activities: video-based (V-PCC) and geometry-based (G-PCC). APSIPA Transactions on Signal and Information Processing 9, pp. e13. External Links: Document Cited by: §I, §III.
  • [15] S. Gu, J. Hou, H. Zeng, H. Yuan, and K. Ma (2019) 3D point cloud attribute compression using geometry-guided sparse representation. IEEE Transactions on Image Processing 29, pp. 796–808. Cited by: §II-A.
  • [16] Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu, and Z. Wang (2019) MFQE 2.0: a new approach for multi-frame quality enhancement on compressed video. IEEE transactions on pattern analysis and machine intelligence 43 (3), pp. 949–963. Cited by: §I-A, §II-B.
  • [17] N. Haala, M. Peter, J. Kremer, and G. Hunter (2008) Mobile LiDAR mapping for 3D point cloud collection in urban areas—a performance test. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 37, pp. 1119–1127. Cited by: §I.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §III-B.
  • [19] Y. He, X. Ren, D. Tang, Y. Zhang, X. Xue, and Y. Fu (2022) Density-preserving deep point cloud compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2333–2342. Cited by: §II-A.
  • [20] W. Jia, L. Li, Z. Li, X. Zhang, and S. Liu (2021) Residual-guided in-loop filter using convolution neural network. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17 (4), pp. 1–19. Cited by: §II-B.
  • [21] M. Karczewicz, N. Hu, J. Taquet, C. Chen, K. Misra, K. Andersson, P. Yin, T. Lu, E. François, and J. Chen (2021) VVC in-loop filters. IEEE Transactions on Circuits and Systems for Video Technology 31 (10), pp. 3907–3925. External Links: Document Cited by: §I-A, §I-B, §II-B, footnote 1.
  • [22] W. Kim, W. Pu, A. Khairat, M. Siekmann, J. Sole, J. Chen, M. Karczewicz, T. Nguyen, and D. Marpe (2020) Cross-component prediction in HEVC. IEEE Transactions on Circuits and Systems for Video Technology 30 (6), pp. 1699–1708. External Links: Document Cited by: §III-A.
  • [23] L. Kong, D. Ding, F. Liu, D. Mukherjee, U. Joshi, and Y. Chen (2020) Guided cnn restoration with explicitly signaled linear combination. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 3379–3383. Cited by: §II-B.
  • [24] K. Lin, C. Jia, X. Zhang, S. Wang, S. Ma, and W. Gao (2022) NR-CNN: nested-residual guided CNN in-loop filtering for video coding. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18 (4), pp. 1–22. Cited by: §II-B, §III-D.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §IV-A.
  • [26] D. Ma, F. Zhang, and D. R. Bull (2020) MFRNet: a new CNN architecture for post-processing and in-loop filtering. IEEE Journal of Selected Topics in Signal Processing 15 (2), pp. 378–387. Cited by: §I-A, §II-B.
  • [27] F. Nasiri, W. Hamidouche, L. Morin, N. Dhollande, and G. Cocherel (2021) A CNN-based prediction-aware quality enhancement framework for VVC. IEEE Open Journal of Signal Processing 2, pp. 466–483. Cited by: §I-A, §II-B.
  • [28] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson, M. Zhou, and G. Van der Auwera (2012) HEVC deblocking filter. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1746–1754. Cited by: §II-B.
  • [29] Z. Pan, X. Yi, Y. Zhang, B. Jeon, and S. Kwong (2020) Efficient in-loop filtering based on enhanced deep convolutional neural networks for HEVC. IEEE Transactions on Image Processing 29, pp. 5352–5366. Cited by: §II-B.
  • [30] M. Quach, J. Pang, D. Tian, G. Valenzise, and F. Dufaux (2022) Survey on deep learning-based point cloud compression. Frontiers in Signal Processing. Cited by: §II-A.
  • [31] S. Schwarz, M. Preda, V. Baroncini, M. Budagavi, P. Cesar, P. A. Chou, R. A. Cohen, M. Krivokuća, S. Lasserre, Z. Li, et al. (2018) Emerging MPEG standards for point cloud compression. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (1), pp. 133–148. Cited by: Fig. 1, §I-A, §I.
  • [32] X. Sheng, L. Li, D. Liu, Z. Xiong, Z. Li, and F. Wu (2021) Deep-PCAC: An end-to-end deep lossy compression framework for point cloud attributes. IEEE Transactions on Multimedia 24, pp. 2617–2632. Cited by: §II-A, §II-B, TABLE VIII.
  • [33] X. Sheng, L. Li, D. Liu, and Z. Xiong (2022) Attribute artifacts removal for geometry-based point cloud compression. IEEE Transactions on Image Processing 31, pp. 3399–3413. Cited by: 1st item, §I-A, §I-A, §IV-A, §IV-B1, TABLE III, TABLE IV, §VI, footnote 4.
  • [34] A. L. Souto and R. L. de Queiroz (2020) On predictive RAHT for dynamic point cloud coding. In IEEE International Conference on Image Processing (ICIP), pp. 2701–2705. Cited by: §II-A.
  • [35] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1649–1668. Cited by: §I.
  • [36] D. Thanh Nguyen and A. Kaup (2022) Learning-based lossless point cloud geometry coding using sparse representations. arXiv e-prints, pp. arXiv–2204. Cited by: 3rd item.
  • [37] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe, T. Chujoh, M. Karczewicz, et al. (2013) Adaptive loop filtering for video coding. IEEE Journal of Selected Topics in Signal Processing 7 (6), pp. 934–945. Cited by: §I-B, §II-B, footnote 1.
  • [38] D. Wang, S. Xia, W. Yang, and J. Liu (2021) Combining progressive rethinking and collaborative learning: a deep framework for in-loop filtering. IEEE Transactions on Image Processing 30, pp. 4198–4211. Cited by: §II-B.
  • [39] J. Wang, D. Ding, Z. Li, X. Feng, C. Cao, and Z. Ma (2021) Sparse tensor-based multiscale representation for point cloud geometry compression. arXiv preprint arXiv:2111.10633. Cited by: 3rd item.
  • [40] J. Wang, D. Ding, Z. Li, and Z. Ma (2021) Multiscale point cloud geometry compression. In 2021 Data Compression Conference (DCC), pp. 73–82. Cited by: 3rd item, §III-B.
  • [41] J. Wang and Z. Ma (2022) Sparse tensor-based point cloud attribute compression. In IEEE 5th International Conference on Multimedia Information Processing and Retrieval, Cited by: §II-A, §IV-A.
  • [42] Z. Wang, C. Ma, R. Liao, and Y. Ye (2021) Multi-density convolutional neural network for in-loop filter in video coding. In 2021 Data Compression Conference (DCC), pp. 23–32. Cited by: §II-B.
  • [43] C. Zhang, D. Florencio, and C. Loop (2014) Point cloud attribute compression with graph transform. In IEEE International Conference on Image Processing (ICIP), pp. 2066–2070. Cited by: §I-A, §II-A, §III.
  • [44] K. Zhang, J. Chen, L. Zhang, X. Li, and M. Karczewicz (2018) Enhanced cross-component linear model for chroma intra-prediction in video coding. IEEE Transactions on Image Processing 27 (8), pp. 3983–3997. Cited by: §III-D.
  • [45] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai (2018) Residual highway convolutional neural networks for in-loop filtering in HEVC. IEEE Transactions on Image Processing 27 (8), pp. 3827–3841. Cited by: §II-B.