I Introduction
Dynamic 3D scenes such as humans in motion can now be captured by arrays of color plus depth (or ‘RGBD’) video cameras [1], and such data is getting very popular in emerging applications such as animation, gaming, virtual reality, and immersive communications. The geometry captured by RGBD camera arrays, unlike computergenerated geometry, has little explicit spatiotemporal structure, and is often represented by sequences of colored point clouds. Frames, which are the point clouds captured at a given time instant as shown in Fig. 1, may have different numbers of points, and there is no explicit association between points over time. Performing motion estimation, motion compensation, and effective compression of such data is therefore a challenging task.
In this paper, we focus on the compression of the 3D geometry and color attributes and propose a novel motion estimation and compensation scheme that exploits temporal correlation in sequences of point clouds. To deal with the large size of these sequences, we consider that the point clouds are voxalized, that is, their 3D positions are quantized to a regular, axisaligned, 3D grid having a given stepsize. This quantization of the space is commonly achieved by modeling the 3D point cloud sequences as a series of octree data structures [1], [2], [3]
. In contrast to polygonal mesh representations, the octree structure exploits the spatial organization of the 3D points, which results in easy manipulations and allows realtime processing. In more details, an octree is a tree structure with a predefined depth, where every branch node represents a certain cube volume in the 3D space, which is called a voxel. A voxel containing a point is said to be occupied. Although the overall voxel set lies in a regular grid, the set of occupied voxels are nonuniformly distributed in space. To uncover the irregular structure of the occupied voxels inside each frame, we consider points as vertices in a graph
, with edges between nearby vertices. Attributes of each point , including 3D position and color components , are treated as signals residing on the vertices of the graph. As frames in the 3D point cloud sequences are correlated, the graph signals at consecutive time instants are also correlated. Hence, removing temporal correlation implies comparing the behavior of the signals residing on the vertices of consecutive graphs. The estimation of the correlation is however a challenging task as the graphs usually have different number of nodes and no explicit correspondence information between the nodes is available in the sequence.We build on our previous work [4], and propose a novel algorithm for motion estimation and compensation in 3D point cloud sequences. We cast motion estimation as a feature matching problem on dynamic graphs. In particular, we compute new local features at different scales with spectral graph wavelets (SGW) [5]
for each node of the graph. Our feature descriptors, which consist of the wavelet coefficients of each of the signals placed in the corresponding vertex, are then used to compute pointtopoint correspondences between graphs of different frames. We match our SGW features in different graphs with a criterion that is based on the Mahalanobis distance and trained from the data. To avoid inaccurate matches, we first compute the motion on a sparse set of matching nodes that satisfy the matching criterion. We then interpolate the motion of the other nodes of the graph by solving a new graphbased quadratic regularization problem, which promotes smoothness of the motion vectors on the graph in order to build a consistent motion field.
Then, we design a compression system for 3D point cloud sequences, where we exploit the estimated motion information in the predictive coding of the geometry and color information. The basic blocks of our compression architecture are shown in Fig. 2. We code the motion field in the graph Fourier domain by exploiting its smoothness on the graph. Temporal redundancy in consecutive 3D positions is removed by coding the structural difference between the target frame and the motion compensated reference frame. The structural difference is efficiently described in a binary stream format as described in [6]. Finally, we predict the color of the target frame by interpolating it from the color of the motion compensated reference frame. Only the difference between the actual color information and the result of the motion compensation is actually coded with a stateoftheart encoder for static octree data [7]. Experimental results illustrate that our motion estimation scheme is efficient as it can capture the motion between consecutive frames. Moreover, introducing motion compensation in compression of 3D point cloud sequences results in significant improvement in terms of ratedistortion performance of the overall system, and in particular in the compression of the color attributes where we achieve a gain of up to 10 dB in comparison to stateoftheart encoders. The contribution of the paper is summarized as follows. To the best of our knowledge, the proposed encoder is the first one in the existing literature that exploits motion estimation to remove the temporal redundancy for efficient coding of point cloud sequences, without going first through the expensive conversion process into a temporally consistent polygonal mesh. Second, we represent the point cloud sequences as a set of graphs and we solve the motion estimation problem as a feature matching problem in dynamic graphs. Third, we propose a differential coding scheme for color compression that provides significant gain in terms of coding performance.
The rest of the paper is organized as follows. First, in Section II, we review the existing work in the literature that studies the problem of compression of 3D point clouds. Next, in Section III, we describe the representation of 3D point clouds by performing an octree decomposition of the 3D space and we introduce graphs to capture the irregular structure of this representation. The motion estimation scheme is presented in Section IV. The estimated motion is then applied to the predictive coding of the geometry and the color in Section V. Finally, experimental results are given in Section VI.
Ii Related work
The direct compression of 3D point cloud sequences has been largely overlooked so far in the literature. A few works have been proposed to compress static 3D point clouds. Some examples include the 2D wavelet transform based scheme of [8], and the subdivision of the point cloud space in different resolution layers using a kdtree structure [9]. An efficient binary description of the spatial point cloud distribution is performed through a decomposition of the 3D space using octree data structures. The octree decomposition, in contrast to the mesh construction, is quite simple to obtain. It is the basic idea behind the geometry compression algorithms of [3], [2]. The octree structure is also adopted in [7], to compress point cloud attributes. The authors construct a graph for each branch of leaves at certain levels of the octree. The graph transform, which is equivalent to the KarhunenLoève transform, is then applied to decorrelate the color attributes that are treated as signals on the graph. The proposed algorithm has been shown to remove the spatial redundancy for compression of the 3D point cloud attributes, with significant improvement over traditional methods. However, all the above methods are designed mainly for static point clouds. In order to apply them to point cloud sequences, we need to consider each frame of the sequence independently, which is clearly suboptimal.
Temporal and spatial redundancy of point cloud sequences has been recently exploited in [6]. The authors compress the geometry by comparing the octree data structure of consecutive point clouds and encoding their structural difference. The proposed compression framework can handle general point cloud streams of arbitrary and varying size, with unknown correspondences. It enables detection and differential encoding of spatial changes within temporarily adjacent octree structures by modifying the octree data structure, without though computing the exact motion of the voxels. Motion estimation in point clouds sequences can be quite challenging due to the fact that pointtopoint correspondences between consecutive frames are not known. While there exists a huge amount of works in the literature that study the problem of motion estimation in video compression, these methods cannot be extended easily to graph settings. In classical video coding schemes, motion in 3D space is mainly considered as a set of displacements in the regular image plane. Pixelbased methods [10], such as block matching algorithms, or optical and scene flow algorithms, are designed for regular grids. Their generalization to the nonEuclidean, irregular graph domain though is not straightforward. Featurebased methods [11], such as interest point detection, have also been widely used for motion estimation in video compression. These features usually correspond to key points of images such as corners or sharp edges [12, 13, 14]. With an appropriate definition of features on graphs, these methods can be extended to graphs. To the best of our knowledge though, so far they are not adapted to estimate the motion on graphs, nor on point clouds. Someone could also apply classical 3D descriptors such as [15, 16, 17, 18, 19, 20] to define 3D features. However, these type of descriptors assume that the point cloud represents a surface, which is not necessarily the case of a graph. Moreover, most of them require the computation of the normals which can be complex to obtain in realtime scenarios. An overview of classical 3D descriptors can be found in [21].
For the sake of completeness, we should mention that 3D point clouds are often converted into polygonal meshes, which can be compressed with a large body of existing methods. In particular, there does exist literature for compressing dynamic 3D meshes with either fixed connectivity and known correspondences (e.g., [22, 23, 24, 25, 26]) or varying connectivity (e.g., [27, 28]). A different type of approach consists of the video based methods. The irregular 3D structure of the meshes is parametrized into a rectangular 2D domain, obtaining the so called geometry images [29] in the case of a single mesh and geometry videos [30] in the case of 3D mesh sequences. The mapping of the 3D mesh surface onto a 2D array, which can be done either by using only the 3D geometry information or both the geometry and the texture information [31], allows conventional video compression to be applied to the projected 2D videos. Within the same line of work, emphasis has been given to extending these types of algorithms to handling sequences of meshes with different numbers of vertices and exploiting temporal correlation between them. An example is the recent work in [32], which proposes a framework for compressing 3D human motion oriented geometry videos by constructing key frames that are able to reconstruct the whole motion sequence. Comparing to the meshbased compression algorithms, the advantage is that the mesh connectivity information does not need to be sent, and the complexity is reduced by performing the operations from the 3D to the 2D space. All the above mentioned works however require the conversion process of the point cloud into a mesh, which is usually computationally expensive, and it cannot be easily applied in realtime applications. Finally, marching cubes algorithm [33] can be used to extract a polygonal mesh in a fast way, but it requires a “filled” volume.
Iii Structural representation of 3D point clouds
3D point clouds usually have little explicit spatial structure. Someone can however organize the 3D space by converting the point cloud into an octree data structure [1], [2], [3]. In what follows, we recall the octree construction process, and introduce graphs as a tool for capturing the structure of the leaf nodes of the octree.
Iiia Octree representation of 3D point clouds
An octree is a tree structure with a predefined depth, where every branch node represents a certain cube volume in the 3D space, which is called a voxel. A voxel containing a sample from the 3D point cloud is said to be occupied. Initially, the 3D space is hierarchically partitioned into voxels whose total number depends on the number of subdivisions, i.e., the depth of the resulting tree structure. For a given depth, an octree is constructed by traversing the tree structure in depthfirst order. Starting from the root, each node can generate eight children voxels. At the maximum depth of the tree, all the points are mapped to leaf voxels. An example of the voxalization of a 3D model for different depth levels, or equivalently for different quantization stepsizes, is shown in Fig. 3.
In contrast to polygonal mesh representations, the octree structure is easy to obtain and effective in realtime applications. Thanks to the different depths of the tree, it allows a multiresolution representation of the data that leads to efficient data processing in many applications. In particular, this multiresolution representation permits a progressive compression of the 3D positions of the data, which is lossless within each representation level [6].
IiiB Graphbased representation of 3D point clouds
Although the overall voxel set lies on a regular grid, the set of occupied voxels is nonuniformly distributed in space, as most of the leaf voxels are unoccupied. In order to represent the irregular structure formed by the occupied voxels, we use a graphbased representation. Graphbased representations are flexible and well adapted to data that live on an irregular domain [34]. In particular, we represent the set of occupied voxels of the octree using a weighted and undirected graph , where and represent the vertex and edge sets of . Each of the nodes in corresponds to an occupied voxel, while each edge in connects neighboring, occupied voxels. We define the connectivity of the graph based on the  nearest neighbors (NN graph), which is widely used in the literature. We usually set to 26 as it corresponds to the maximum number of neighbors for a node that have a maximum distance of one step along any axis of the 3D space. Two vertices are thus considered to be neighbors if they are among the 26 nearest neighbors in the voxel grid. The matrix is a matrix of positive edge weights, with denoting the weight of an edge connecting vertices and . This weight captures the connectivity pattern of nearby occupied voxels and are chosen to be inversely proportional to the distances between voxels.
IiiC Graph Fourier transform of 3D point cloud attributes
After the graph is constructed, we consider the attributes of the 3D point cloud — the 3D coordinates and the color components — as signals that reside on the vertices of the graph
. A spectral representation of these signals can be obtained with the help of the Graph Fourier Transform (GFT). The GFT is defined through the graph Laplacian operator
, where is the diagonal degree matrix whose diagonal element is equal to the sum of the weights of all the edges incident to vertex [35]. The graph Laplacian is a real symmetric matrix that has a complete set of orthonormal eigenvectors with corresponding nonnegative eigenvalues. We here denote its eigenvectors by
, and the spectrum of eigenvalues by , where is the number of vertices of the graph. The multiplicity of the smaller eigenvalue indicates the number of connected components of the graph. The GFT of any graph signal is then defined aswhile the inverse graph Fourier transform is given by
The GFT is useful to have an effective representation of the data. Furthermore, it has been shown to be optimum for decorrelating a signal following the Gaussian Markov Random Field model with precision matrix [36]. The GFT will be used later to define features in point cloud frames, and for effective coding of the data on the graph.
Iv Motion estimation in 3D point cloud sequences
As the frames have irregular structures, we use a featurebased matching approach to find correspondences. We use the graph information and the signals residing on its vertices to define feature descriptors on each vertex. We first define simple octant indicator functions to capture the signal values in different orientations. We then characterize the local topological context of each of the point cloud signals in each of these orientations, by using spectral graph wavelets (SGW) computed on the color and geometry signals at different resolutions [5]. Our feature descriptors, which consist of the wavelet coefficients of each of the signals placed in the corresponding vertex, are then used to compute pointtopoint correspondences between graphs of different frames. We select a subset of best matching nodes to define a sparse set of motion vectors that describe the temporal correlation in the sequence. A dense motion field is eventually interpolated from the sparse set of motion vectors to obtain a complete mapping between two frames. The overall procedure is detailed below.
Iva Multiresolution features on graphs
We define features in each node by computing the variation of the signal values, i.e., geometry and color components, in different parts of its neighborhood. For each node belonging to the vertex set of a graph , i.e., , we first define the octant indicator function , given as follows for the first octant
where is the indicator function on , evaluated in a set of voxels given by specific 3D coordinates. The first octant indicator function is thus nonzero only in the entries corresponding to the voxels whose 3D position coordinates are bigger than the ones of node . We consider all possible combinations of coordinates, which results in a total of indicator functions for the eight octants around . These functions result in a clustering of the nodes of the graph, providing a notion of orientation of each node in the 3D space with respect to , which is clearly provided by the voxel grid.
We then compute features based on both geometry and color information, by treating their values independently in each orientation. In particular, for each node and each geometry and color component , where , we compute the spectral graph wavelet coefficients by considering independently the values of in each orientation with respect to node such that
(1) 
where , and , is a set of discrete scales. The function represents the spectral graph wavelet of scale placed at that particular node , and denotes the pointwise product. We recall that the spectral graph wavelets [5] are operatorvalued functions of the graph Laplacian defined as
The graph wavelets are determined by the choice of a generating kernel , which acts as a bandpass filter in the spectral domain, and a scaling kernel that acts as a lowpass filter. The scaling is defined in the spectral domain, i.e., the wavelet operator at scale is given by . Spectral graph wavelets are finally realized through localizing these operators via the impulse on a single vertex . The scaling function , which is analogous to the lowpass scaling function in classical wavelet analysis, captures the low frequency content of the signals [5]. It thus acts as a low pass filter and helps ensure stable recovery of the graph signal from the wavelet coefficients. The application of these wavelets to signals living on the graph results in a multiscale descriptor for each node. The descriptor characterizes the local topological context of the signals in the different neighborhoods defined by the octant indicator functions. We define the feature vector at node as the concatenation of the coefficients computed in (1) with wavelets at different scales, including the features obtained from the scaling function, i.e., . These features can be efficiently computed by approximating the spectral graph wavelets with Chebyshev polynomials, as described in [5]. Given this approximation, the wavelet coefficients at each scale can then be computed as a polynomial of applied to the graph signal . The latter can be performed in a way that accesses only through iterative matrixvector multiplications. The polynomial approximation can be particularly efficient when the graph is sparse, which is indeed the case of our NN graph. Moreover, this approximation avoids the need to compute the complete spectrum of the graph Laplacian matrix. Thus, the computational cost of the features can be substantially reduced.
Finally, we note that spectral features have recently started to gain attention in the computer vision and shape analysis community. The heat kernel signatures
[37], their scaleinvariant version [38], the wave kernel signatures [39], the optimized spectral descriptors of [40], have already been used in 3D shape processing with applications in graph matching [41] or in mesh segmentation and surface alignment problems [42]. These features have been shown to be stable under small perturbations of the edge nodes of the graph. In all these works though, the descriptors are defined based only on the graph structure, and the information about attributes of the nodes such as color and 3D positions, if any, is assumed to be introduced in the weights of the graph. The performance of these descriptors depends on the quality of the defined graph. In contrast to this line of works, we define features by considering attributes as signals that reside on the vertices of a graph and characterize each vertex by computing the local evolution of these signals at different scales. Furthermore, this approach gives us the flexility to consider the signal values in different orientations as discussed above, and makes the descriptor of each node more informative.IvB Finding correspondences on dynamic graphs
We translate the problem of finding correspondences in two consecutive point clouds or frames of the sequence into finding correspondences on the vertices of their representative graphs. For the rest of this paper, we denote the sequence of frames as and the set of graphs corresponding to each frame as . For two consecutive frames of the sequence, , referred also as reference and target frame respectively, our goal is to find correspondences between the vertices of their representative graphs and . The number of vertices can differ between the graphs and is denoted as and respectively.
Given two graphs , , and their respective vertex sets , we use the features defined in the previous subsection to measure the similarity between vertices. We compute the matching score between two nodes as the Mahalanobis distance between the corresponding feature vectors, i.e.,
(2) 
where is a matrix that characterize the relationships between the geometry and the color feature components, which are measured in different units, as well as the contribution of each of the wavelet scales in the matching performance. We learn the positive definite matrix by estimating the sample inverse covariance matrix from a set of training features that are known to be in correspondence. If corresponds to , this models as a Gaussian random vector with mean and covariance , while if does not correspond to , this models as coming from a very flat (essentially uniform) distribution. Hence the matching score can be considered a log likelihood ratio for testing the hypothesis that corresponds to .
For each node in , we use the matching score to define its best matching node in . In particular, for each , we define as its best match in , the node with the minimum Mahalanobis distance, i.e.,
From the global set of correspondences computed for all the nodes of , we select a sparse set of significant matches. The objective of this selection is to take into consideration only accurate matches and ignore others since inaccurate correspondences are possible with our spectral descriptors in the case of large displacements. We also want to avoid matching points in that do not have any true correspondence in the preceding frame . The sparse set of matching nodes will later be used for interpolating the motion across all the nodes of the graph. For that reason, we need to ensure that we keep correspondences in all areas of the 3D space, and these correspondences should be spatially discriminative. We cluster the vertices of into different regions and we keep only one correspondence, i.e., one representative vertex, per region. Clustering is performed by applying means in the 3D coordinates of the nodes of the target frame, where is usually set to be equal to the target number of significant matches. In order to avoid inaccurate matches, a representative vertex per cluster is included in the sparse set only if its best score is smaller than a predefined threshold. This procedure results in detecting a sparse set of vertices in , denoted , and the set of their correspondences in , . Moreover, our sparse set of matching points tend to be accurate correspondences that are well distributed spatially.
IvC Computation of the motion vectors
We now describe how we generate a dense motion field from the sparse set of matching nodes . Our implicit assumption is that vertices that are close in terms of 3D positions, which means that they are close neighbors in the underlying graph, undergo a similar motion. We thus use the structure of the graph in order to interpolate the motion field, which is assumed to be smooth on the graph.
In more detail, our goal is to estimate the dense motion field , for all , using the correspondences . To determine for , we use the vector between the pair of matching points ,
(3) 
Here we recall that and are the 3D positions of the vertices of and , respectively. To determine for , we consider the motion field , like , to be a vectorvalued signal that lives on the vertices of . Then we smoothly interpolate the sparse set of motion vectors (3). The interpolation is performed by treating each component independently. Given the motion values on some of the vertices, we pose interpolation as a regularization problem that estimates the motion values on the rest of the vertices by requiring the motion signal to vary smoothly across vertices that are connected by an edge in the graph. Moreover, we allow some smoothing on the known entries. The reason for that is that the proposed matching scheme does not necessarily guarantee that the sparse set of correspondences, and the estimated motion vectors associated with them, are correct. To limit the effect of motion estimation inaccuracies, for each matching pair , we model the matching score in the local neighborhood of with a smooth signal approximation. Specifically, for each , we extend the definition (3) to all , i.e.,
Then, for each node that belongs to the twohop neighborhood of i.e., , we express as a function of the geometric distance of from , using a secondorder Taylor series expansion around . That is,
(4) 
For each , we take to be a discrete sampled version of a continuous function where the second order Taylor approximation is
Thus for each , we assume that the matching score with respect to nodes that are in the neighborhood of its best match can be well modeled by a quadratic approximation function. We estimate of this quadratic approximation as the normalized covariance matrix of the 3D offsets,
This is motivated by the fact that if , then
satisfies . Hence,
lies in an ellipsoid whose second moment is proportional to
. Though there are other ways for computing in (4), this momentmatching method is fast while guaranteeing that is positive semidefinite. Next, we use the covariance matrices of the 3D offsets to define a diagonal matrix , such thatwhere if for some , and otherwise. The matrix captures the second order Taylor approximation of the total match score as a function of the motion vectors in the neighborhoods of nodes in and is used to regularize the motion vectors of the known entries in as shown next.
Finally, we interpolate the dense set of motion vectors by taking into account the covariance of the motion vectors in the neighborhoods around the points that belongs to the sparse set and imposing smoothness of the dense motion vectors on the graph
(5) 
where is a selection matrix for each of the 3D components respectively, and is the Laplacian matrix of the graph . is the concatenation of the initial motion vectors, with , if . We note that the optimization problem consists of a fitting term that penalizes the excess matching score on the sparse set of matching nodes, and of a regularization term that imposes smoothness of the motion vectors in each of the position components independently. The tradeoff between the two terms is defined by the constant . A small promotes a solution that is closed to , while a big
favors a solution that is very smooth. Similar regularization techniques, which are based on the notion of smoothness of the graph Laplacian, have been widely used in the semisupervised learning literature
[43, 44]. The corresponding optimization problem is convex and it has a closed form solution given by(6) 
which can be computed iteratively using MINRESQLP [45] in large systems. With a slight abuse of notation, we will from now on denote as the reshaped motion vectors of dimensionality , where each row represents the motion in one of the three coordinates. Finally, denotes the 3D motion vector of node .
V Compression of 3D point cloud sequences
We describe now how the above motion estimation can be used to reduce temporal redundancy in compression of 3D point cloud sequences, by describing in detail each block of Fig. 2. We first code the motion vectors by transforming them to the graph Fourier domain. Coding of the 3D positions is then performed by comparing the structural difference between the target frame () and the motion compensated reference frame (). Temporal redundancy in color compression is finally exploited by encoding the difference between the target frame and the color prediction obtained from motion compensation.
Va Coding of motion vectors
We recall that, for each pair of two consecutive frames , the sparse set of motion vectors is initially smoothed at the encoder side. The estimated dense motion field is then transmitted to the decoder. One could transmit only the motion vectors on the sparse set of matching points and solve the interpolation problem (6) at the decoder. That would however increase the complexity of the decoder. We exploit the fact that the graph Fourier transform is suitable for compressing smooth signals [46], [36], by coding the motion vectors in the graph Fourier domain. In particular, since the motion is estimated in each of the nodes of the graph , we use the eigenvectors of the graph Laplacian operator corresponding to the graph of the reference frame, to transform the motion in each of the directions separately such as
The transformed coefficients are uniformly quantized as , where is the quantization stepsize that is constant across all the coefficients, and refers to the rounding operation. The choice of the quantization stepsize will be discussed in the experimental section. The quantized coefficients are then entropy coded independently with the adaptive runlength / golombrice (RLGR) entropy coder [47] and sent to the decoder. The decoder performs the reverse procedure to obtain the decoded motion vectors . Note that given that the decoder already knows the 3D positions of the reference frame, it can recover the NN graph. Thus, the connectivity of the graph does not have to be sent to the decoder. A block diagram of the encoder and the decoder is shown in Fig. 4.
VB Motion compensated differential coding of 3D geometries
From the current frame and its quantized motion vectors , both of which are signals on , it should be possible to predict the 3D positions of the points in the target frame , which is a signal on . Since the two graphs are of different sizes, a vector space prediction of from is not possible. One can however warp to in order to obtain a warped frame that is close to in some sense. Given that the 3D positions and the decoded motion vectors of are known to both the encoder and the decoder, the position of node in the warped frame can be estimated on both sides as
(7) 
Note that the warped frame remains a signal on the graph .
Given the warped frame , we use the realtime compression algorithm proposed in [6] to code the structural difference between the 3D positions of and . This algorithm essentially codes the set difference between the set of voxels occupied by the points of and the set of voxels occupied by the points of . Specifically, we assume that the point clouds corresponding to and have already been spatially decomposed into octree data structures at a predefined depth. At each level of the tree, the representation of each octree is done by capturing the occupancies of the children of an occupied voxel with a single byte, whose bits are set to one if the corresponding child is occupied and zero otherwise. Assuming a consistent order of the octants, each octree is then characterized by a bit stream, whose total size in bits is equal to eight times the number of internal nodes of the tree.
Given that both the encoder and decoder know the occupied voxels of the reference frame and the motion vectors , they are able to compute the occupied voxels of the motion compensated reference frame . The encoding of the occupied voxels of the target frame is performed by encoding the exclusiveOR (XOR) between the indicator functions for the occupied voxels in frames and . This can be implemented by an octree decomposition of the set of voxels that are occupied in but not in , or vice versa, as illustrated in Fig. 5(a). Thus, motion compensation is expected to reduce the set difference and hence the number of bits used by the octree decomposition. The decoder can eventually use the motion compensated previous frame and the bits from the octree decomposition to recover exactly the set of occupied voxels (and hence the graph and 3D positions) of the target frame . A schematic overview of the encoding and decoding architecture is shown in Fig. 5(b). A detailed description of the algorithm can be found in the original paper [6].
VC Motion compensated differential coding of color attributes
After coding the 3D positions and the motion vectors, motion compensation is used to predict the color of the target frame from the motion compensated reference frame. While the 3D positions of the points in the warped frame are based on the 3D positions of the previous frame and the motion field on the graph according to (7), the colors of the warped frame can be transferred directly from according to
Unfortunately, the graphs and have different sizes; as a consequence there is no direct correspondence between their nodes. However, since is obtained by warping to , we can use the colors of the points in to predict the colors of nearby points in . To be specific, for each , we compute a predicted color value by finding the nearest neighbors in terms of the Euclidean distance of the 3D positions and , and attributing to their average color i.e.,
where the cardinality of , i.e., , is usually set to 3.
Temporal redundancy in the color information is then removed by coding only the residual of the target frame with respect to the color prediction obtained with the above method, i.e., . The main blocks of the compression scheme are performed using the recently introduced graphbased compression algorithm of [7]. The algorithm is designed for compressing the 3D color attributes in static frames and it essentially removes the spatial correlation within each frame by coding each color component in the graph Fourier domain. The algorithm divides each octree in small blocks containing voxels. In each of these blocks, it constructs a graph and computes the graph Fourier transform as described in Section III. We adapt the algorithm to sequences by applying the graph Fourier transform to the color residual , and the residuals in each of the three color components are encoded separately. The graph Fourier coefficients are quantized uniformly.
The quantized coefficients are then entropy coded, where the structure of the graph is exploited for better efficiency. In particular, the coefficients corresponding to the zero eigenvalue of the graph Laplacian represent the DC term as they capture the average of each connected component of the graph. The rest of the coefficients capture higher oscillations of the residual on the graph and they represent the AC term. Due to their different behavior, the DC and the AC coefficients are treated differently during the entropy coding step. The AC coefficients are assumed to follow a continuous scaled Laplacian distribution, with a diversity parameter inversely proportional to the square root of the corresponding eigenvalue. This Laplacian distribution is used by a simple arithmetic encoder to encode the AC components. After encoding a new coefficient, the diversity parameter is then updated as defined in [7]
. The coding of the DC components is first performed by removing the mean of the previously decoded DC terms. The normalized DC coefficient is also assumed to follow a Laplacian distribution with probability function characterized by a diversity parameter that is inversely proportional to the number of connected voxels. More details about the color coding scheme are given in
[7] and a schematic overview is given in Fig. 6. Finally, we recall that while the algorithm was originally used for coding static frames, in this paper we use it for coding the residual of the target frame from the motion compensated reference frame. The algorithm however remains a valid choice as the statistics are adapted to the actual signal characteristics.Vi Experimental results
We illustrate in this section the matching performance obtained with our motion estimation scheme and the performance of the proposed compression scheme. We use two different sequences that capture human bodies in motion, i.e., the yellow dress and the man sequences, which have been voxalized to resemble data collected by the realtime high resolution sparse voxalization algorithm [1]. The first sequence consists of 64 frames, and the second one of 30 frames. For each frame, we voxelize the point cloud to a voxel stepsize of 20, which generates a set of approximately 8500 occupied voxels out of a total of 75000 initial 3D points with color attributes. The number of voxels depends on the size of the actual frames. As an illustrative example, in this paper the chosen voxel stepsize corresponds to an octree depth of seven. However, our motion estimation and compression scheme can be applied to any other octree level, with similar performance. The graph based interpolation problem of (6) is solved using MINRESQLP [45].
Via Motion estimation
We first illustrate the performance of our motion estimation algorithm by studying its effect in motion compensation experiments. We select two consecutive frames for each sequence, namely the reference () and the target frame (). The graph for each frame is constructed as described in Section III. We define spectral graph wavelets of 4 scales on these graphs, and for computational efficiency, we approximate them with Chebyshev polynomials of degree 30 [5]. We select the number of representative feature points to be around 500, which corresponds to fewer than of the total occupied voxels, and we compute the sparse motion vectors on the corresponding nodes by spectral matching. We estimate the motion on the rest of the nodes by smoothing the motion vectors on the graph according to (5).
In Figs. 7(a), 7(d), we superimpose the reference and the target frame for the yellow dress and the man sequence accordingly in order to illustrate the motion involved between two consecutive frames. The key points used for spectral matching in each of the two frames are shown in Figs. 7(b), 7(e) and they are represented in red for the target and in green for the reference frame. For the sake of clarity, we highlight only some of the correspondences used for computing motion vectors. We observe that the sparse set of matching vertices are accurate and welldistributed in space for both sequences. Finally, in Figs. 7(c), 7(f), we superimpose the target frame and the voxel representation of the motion compensated reference frame. By comparing visually these two figures to 7(a), 7(d) respectively, we observe that in both cases the motion compensated reference frame is much closer to the target frame than the simple reference frame. The obtained results confirm that our algorithm is able to estimate accurately the motion.
ViB 3D geometry compression
We now study the benefits of motion estimation in the compression of geometry in 3D point cloud sequences. The compressed geometry information includes motion vectors, the 3D positions of the reference frame, and the structural difference of the target frame and the motion compensated reference frame captured by the XOR encoded information. We note that the compression is performed on the whole sequence. The frames of the sequences are coded sequentially in the following way. Only the first frame is coded independently, while all the other frames are coded by using as a reference frame the previously coded frame. We first code the motion vectors with the proposed coding scheme of Sec. VA. The motion signal in each of the three directions is coded separately. In Fig. 8 we show the advantage of transforming the motion vectors in the graph Fourier domain, in comparison to coding directly in the signal domain, for the man sequence. Different stepsizes for uniform quantization are used to obtain different coding rates, hence different accuracies of the motion vectors. The performance is measured in terms of the signaltoquantization noise ratio (SQNR) for a fixed number of bits per vertex. The SQNR is computed on pairs of frames. Each point in the rate distortion curve corresponds to the average over 64 frames. The results confirm that coding the motion vectors in the graph Fourier domain results in an efficient spatial decorrelation of the motion signals, which brings significant gain in terms of coding rate. Similar results hold for the yellow dress sequence, but we omit them due to the lack of space.
We study next the effect of motion compensation in the coding rate of the 3D positions. We recall that the coding of the geometry is lossless. There exists however a tradeoff between the overall coding rate of the geometry and the coding rate of the motion vectors as we illustrate next. In particular, we compare the motion compensated dual octree scheme as described in Sec. VB, to the dual octree scheme of [6], and the simple octree compression algorithm. In Fig. 9, we illustrate the coding rate of the geometry with respect to the coding rate of the motion vectors, measured in terms of the average number of bits per vertex (bpv) over all the frames, for each of the three competitive schemes. The coding rate of the geometry includes the coding rate of the motion vectors. In Fig. 9(a), the smallest coding rate of the geometry (3.3 bpv) for the man sequence is achieved for a coding rate of the motion vectors of only 0.1 bpv. The latter indicates that coarse quantization of the motion vectors is enough for an efficient geometry compression. A smaller number of bits per vertex however tends to penalize the effect of motion compensation, giving an overall coding rate that approaches the one of the dual octree compression scheme. Of course, a finer coding of the motion vectors increases the overhead in the total coding rate of the geometry. The corresponding numbers for the simple octree and the dual octree compression scheme are approximately 3.42 and 3.5 respectively. These results indicate that the temporal structure captured by the dual octree compression scheme is not sufficient to improve the coding rate with respect to the simple octree compression algorithm. Motion compensation is thus needed to remove the temporal correlation. However, the overall gain that we obtain is small and corresponds to and with respect to the simple octree and the dual octree compression algorithm respectively. Similar results are observed in Fig. 9(b) for the yellow dress sequence. In order to study the effect of the motion in the compression performance, we perform two different tests. In the first test, we compress the entire yellow dress sequence, which is a low motion sequence. In the second test, we sample the sequence by keeping only 10 frames that are characterized by higher motion between consecutive frames. We then compress the geometry for this new smaller sequence. In Fig. 9(b), we observe that when the motion is low, the motion compensated dual octree and the dual octree compression algorithms are much more efficient in coding the geometry in comparison to the simple octree compression algorithm. Moreover, the motion compensated dual octree scheme requires a slightly smaller number of bits per vertex (2.2 bpv), for a coding rate of the motion vectors of 0.1 bpv. The coding rate for the dual octree and the simple octree compression algorithm are respectively 2.24 and 2.6 bpv. On the other hand, the simple octree compression scheme outperforms the dual octree compression algorithm, in the higher motion sequence of 10 frames, with coding rates of 3 and 2.8 bpv respectively. The motion compensated dual octree compression algorithm can close the gap between these two methods by achieving a coding rate of 2.8 bpv. We note that this performance is achieved for an overhead of 0.15 bpv for coding the motion vectors. Due to this overhead, the performance of the simple octree and the motion compensated dual octree compression algorithm are relatively close. To summarize, the above results indicate that, although motion compensation can improve the overall geometry compression performance, this improvement is marginal.
ViC Color compression
In the next set of experiments, we use motion compensation for color prediction, as described in Section VC. That is, using the smoothed motion field, we warp the reference frame to the target frame , and predict the color of each point in as the average of the three nearest points in the warped frame . We fix the coding rate of the motion vectors to 0.1 bpv and, for the sake of comparison, we compute the signaltonoise ratio (SNR) after predicting the color in the following three different ways: (i) the colors of points in the target frame are predicted from their nearest neighbors in the warped frame , (ii) the colors of points in the target frame are predicted from their nearest neighbors in , and (iii) the colors of points in the target frame are predicted as the average color of all the points in . The SNR for frame is defined as , where we recall that and are the actual color and the color prediction respectively. The prediction error is measured by taking pairs of frames in the sequence and computing the average over all the pairs. The obtained values for the man sequence are (i) 13 dB, (ii) 10.5 dB, and (iii) 4 dB, while for the yellow dress sequence the corresponding values are (i) 17 dB, (ii) 15 dB, and (iii) 6.5 dB. We notice that for both sequences motion compensation can significantly reduce the prediction error, by obtaining an average gain in the color prediction of 2.5 dB and 810 dB with respect to predicting simply based on the color of the nearest neighbors in the reference frame, and the average color of the reference frame respectively.
We finally use the prediction obtained from our motion estimation and compensation scheme to build a full scheme for color compression, that is based on a prediction path of a series of frames. Compression of color attributes is obtained by coding the residual of the target frame with respect to the color prediction obtained with the scheme described in Section VC. In our experiments, we code the color in small blocks of voxels. We measure the PSNR obtained for different levels of the quantization stepsize in the coding of the color information, hence different coding rates, for both independent [7] and differential coding. The results for both datasets are shown in Fig. 10. Each point on the curve corresponds to the average PSNR of the R,G,B components across the first ten frames of each sequence, obtained for a quantization stepsize of respectively. We observe that at low bit rate (), differential coding provides a gain with respect to independent coding of approximately 10 dB for both sequences. On the other hand, at high bit rate, the difference between independent and differential coding tends to become smaller, as both methods can code the color quite accurately. We note that the gain in the coding performance is highly dependent on the length of the prediction path. As the number of predicted frames increases, the accumulated quantization error from the previously coded frames is expected to lead to a gradual PSNR degradation that is more significant at low bit rate. This can be mitigated by periodic insertion of reference frames, and by optimizing the number of predicted frames between consecutive reference frames.
ViD Discussion
Motion compensation is beneficial overall in the compression of 3D point cloud sequences. The main benefit though is observed in the coding of the color attributes, providing a gain of up to 10 dB with respect to coding each frame independently. The gain in the compression of the 3D geometry is only marginal due to the overhead for coding the motion vectors. Finally, from the experimental validation in both datasets, we observe that from the overall bit budget the most significant part is used for the compression of the geometry. On the other hand, a very coarse quantization of the motion vectors is sufficient to achieve an overall good compression rate. For example, for each vertex in the man sequence, we need 0.10.2 bits to code the motion vectors, 0.10.3 bits for the color residual, and 3.3 bits for the geometry compression. Similar observations hold for the second dataset. The proposed motion compensated geometry compression framework that is based on the differential coding of consecutive octree graph structures is the most expensive part of the overall compression system. For a particular depth of the tree, the compression is lossless. A lossy geometry compression scheme that reduces the coding rate seems to be an interesting future direction.
Vii Conclusions
In this paper, we have proposed a novel compression framework for 3D point cloud sequences that is based on exploiting temporal correlation between consecutive point clouds. We have first proposed an algorithm for motion estimation and compensation. The algorithm is based on the assumption that 3D models are representable by a sequence of weighted and undirected graphs and the geometry and the color of each model can be considered as graph signals residing on the vertices of the corresponding graphs. Correspondence between a sparse set of nodes in each graph is first determined by matching descriptors based on spectral features that are localized on the graph. The motion on the rest of the nodes is interpolated by exploiting the smoothness of the motion vectors on the graph. Motion compensation is then used to perform geometry and color prediction. Finally, these predictions are used to differentially encode both the geometry and the color attributes. Experimental results have shown that the proposed method is efficient in estimating the motion and it eventually provides significant gain in the overall compression performance of the system.
Acknowledgements
The authors would like to thank Cha Zhang and Dinei Florêncio for providing the code for the color encoder and help in the color compression experiments.
References
 [1] C. Loop, C. Zhang, and Z. Zhang, “Realtime highresolution sparse voxelization with application to imagebased modeling,” in HighPerformance Graphics Conf., 2013, pp. 73–79.
 [2] Y. Huang, J. Peng, C. C. J. Kuo, and M. Gopi, “A generic scheme for progressive point cloud coding,” IEEE Trans. Vis. Comput. Graph., vol. 14, no. 2, pp. 440–453, 2008.
 [3] R. Schnabel and R. Klein, “Octreebased pointcloud compression,” in Symposium on PointBased Graphics, Jul. 2006.
 [4] D. Thanou, P. A. Chou, and P. Frossard, “Graphbased motion estimation and compensation for dynamic 3D point cloud compression,” in IEEE Int. Conf. on Image Process., Accepted for presentation, Sep. 2015.
 [5] D. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Appl. Comput. Harmon. Anal., vol. 30, no. 2, pp. 129–150, Mar. 2010.
 [6] J. Kammerl, N. Blodow, R. B. Rusu, S. Gedikli, M. Beetz, and E. Steinbach, “Realtime compression of point cloud streams,” in IEEE Int. Conf. on Robotics and Automation, May 2012.
 [7] C. Zhang, D. Florêncio, and C. Loop, “Point cloud attribute compression with graph transform,” in IEEE Int. Conf. on Image Process., Sep. 2014, pp. 2066 – 2070.
 [8] T. Ochotta and D. Saupe, “Compression of pointbased 3D models by shapeadaptive wavelet coding of multiheight fields,” in Eurographics Conf. on PointBased Graphics, 2004, pp. 103–112.
 [9] O. Devillers and PM. Gandoin, “Geometric compression for interactive transmission,” in IEEE Visualization, 2000, pp. 319–326.
 [10] M. Irani and P. Anandan, “About direct methods,” in Inter. Workshop on Vision Algorithms: Theory and Practice, 2000, pp. 267–277.
 [11] P. H. S. Torr and A. Zisserman, “Feature based methods for structure and motion estimation,” in Inter. Workshop on Vision Algorithms: Theory and Practice, 2000, pp. 278–294.
 [12] D. Lowe, “Distinctive image features from scaleinvariant keypoints,” Inter. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.
 [13] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European Conf. on Computer Vision, 2006, pp. 404–417.
 [14] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust widebaseline stereo from maximally stable extremal regions,” in British Machine Vision Conf., 2002, pp. 384–393.
 [15] P. Scovanner, S. Ali, and M. Shah, “A 3dimensional sift descriptor and its application to action recognition,” in Inter. Conf. on Multimedia, 2007, pp. 357–360.
 [16] F. Tombari, S. Salti, and L. di Stefano, “Unique signatures of histograms for local surface description,” in European Conf. on Computer Vision, Sept. 2010, pp. 356–369.
 [17] A. Zaharescu, E. Boyer, K. Varanasi, and R. Horaud, “Surface feature detection and description with applications to mesh matching,” in IEEE Conf. on Comp. Vision and Pattern Recogn., Jun. 2009, pp. 373–380.
 [18] F. Tombari, S. Salti, and L. di Stefano, “A combined textureshape descriptor for enhanced 3D feature matching,” in IEEE Int. Conf. on Image Process., 2011, pp. 809–812.
 [19] H. Chen and B. Bhanu, “3D freeform object recognition in range images using local surface patches,” Pattern Recogn. Lett., vol. 28, no. 10, pp. 1252–1262, Jul. 2007.
 [20] I. Sipiran and B. Bustos, “A robust 3D interest points detector based on harris operator,” in Eurographics Conf. on 3D Object Retrieval, 2010, pp. 7–14.
 [21] L. A. Alexandre, “3D descriptors for object and category recognition: a comparative evaluation,” in IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems, 2012.
 [22] J. Peng, ChangSu Kim, and C. C. Jay Kuo, “Technologies for 3D mesh compression: A survey,” Journal of Vis. Comun. and Image Represent., vol. 16, no. 6, pp. 688–733, Dec. 2005.
 [23] J. Rossignac, “Edgebreaker: Connectivity compression for triangle meshes,” IEEE Trans. on Visualization and Computer Graphics, vol. 5, no. 1, pp. 47–61, Jan. 1999.
 [24] M. Alexa and W. Müller, “Representing animations by principal components,” Comput. Graph. Forum, vol. 19, no. 3, pp. 411–418, Sep. 2000.
 [25] L. Váša and V. Skala, “Geometry driven local neighborhood based predictors for dynamic mesh compression,” Comput. Graph. Forum, vol. 29, no. 6, pp. 1921–1933, 2010.
 [26] H. Q. Nguyen, P. A. Chou, and Y. Chen, “Compression of human body sequences using graph wavelet filter banks,” in IEEE Int. Conf. Acoustic, Speech, and Signal Process., May 2014, pp. 6152––6156.
 [27] S.R. Han, T. Yamasaki, and K. Aizawa, “Timevarying mesh compression using an extended block matching algorithm,” IEEE Trans. Circuits Syst. Video Techn., vol. 17, no. 11, pp. 1506–1518, Nov. 2007.
 [28] S. Gupta, K. Sengupta, and A. A. Kassim, “Registration and partitioningbased compression of 3D dynamic data,” IEEE Trans. Circuits Syst. Video Techn., vol. 13, no. 11, pp. 1144–1155, Nov. 2003.
 [29] X. Gu, S. J. Gortler., and H. Hoppe, “Geometry images,” in Annual Conf. on Computer Graphics and Interactive Techniques, 2002, pp. 355–361.
 [30] H. M. Briceno, P. V. Sander, L. McMillan, S. Gortler, and H. Hoppe, “Geometry Videos: A New Representation for 3D Animations ,” in ACM Symposium on Computer Animation, 2003, pp. 136–146.
 [31] H. Habe, Y. Katsura, and T. Matsuyama, “Skinoff: Representation and compression scheme for 3D video,” in Picture Coding Symp., 2004, pp. 301–306.
 [32] J. Hou, L. Chau, N. MagnenatThalmann, and Y. He, “Compressing 3D human motions via keyframebased geometry videos,” IEEE Trans. Circuits Syst. Video Techn., vol. 25, no. 1, pp. 51–62, 2015.
 [33] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3D surface construction algorithm,” in 14th Annual Conf. on Comp. Graphics and Interactive Techniques, 1987, pp. 163–169.

[34]
D. I Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst,
“The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,”
IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, May 2013.  [35] F. Chung, Spectral Graph Theory, American Mathematical Society, 1997.
 [36] C. Zhang and D. Florêncio, “Analyzing the optimality of predictive transform coding using graphbased models,” IEEE Signal Process. Lett., vol. 20, no. 1, pp. 106–109, 2013.
 [37] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably informative multiscale signature based on heat diffusion,” in Symposium on Geometry Process., 2009, pp. 1383–1392.
 [38] M. M. Bronstein and I. Kokkinos, “Scaleinvariant heat kernel signatures for nonrigid shape recognition,” in IEEE Conf. on Comp. Vision and Pattern Recogn., 2010, pp. 1704–1711.
 [39] M. Aubry, U. Schlickewei, and D. Cremers, “The wave kernel signature: A quantum mechanical approach to shape analysis,” in IEEE Int. Conf. Comp. Vision, 2011, pp. 1626–1633.
 [40] R. Litman and A. M. Bronstein, “Learning spectral descriptors for deformable shape correspondence,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 171–180, 2014.
 [41] N. Hu, R. M. Rustamov, and L. Guibas, “Stable and informative spectral signatures for graph matching,” in IEEE Conf. on Comp. Vision and Pattern Recogn., Jun. 2014.
 [42] W. H. Kim, M. K. Chung, and V. Singh, “Multiresolution shape analysis via noneuclidean wavelets: Applications to mesh segmentation and surface alignment problems,” in IEEE Conf. on Comp. Vision and Pattern Recogn., 2013, pp. 2139–2146.

[43]
X. Zhu, Z. Ghahramani, and J. D. Lafferty,
“Semisupervised learning using gaussian fields and harmonic
functions,”
in
Inter. Conf. on Machine Learning
, 2003, vol. 13, pp. 912–919.  [44] D. Zhou, O. Bousquet, T .N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Advances in Neural Information Process. Systems, 2004, vol. 16, pp. 321–328.
 [45] S.C. T. Choi and M. A. Saunders, “MINRESQLP for symmetric and hermitian linear equations and leastsquares problems,” ACM Trans. Math. Softw., vol. 40, no. 2, 2014.
 [46] X. Zhu and M. Rabbat, “Approximating signals supported on graphs,” in IEEE Int. Conf. Acoustic, Speech, and Signal Process., Jul. 2012, pp. 3921–3924.
 [47] H. S. Malvar, “Adaptive runlength / golombrice encoding of quantized generalized gaussian sources with unknown statistics,” in Data Compression Conf., Mar. 2006.
Comments
There are no comments yet.