3D Point Cloud Enhancement using Graph-Modelled Multiview Depth Measurements

02/11/2020 ∙ by Xue Zhang, et al. ∙ 0

A 3D point cloud is often synthesized from depth measurements collected by sensors at different viewpoints. The acquired measurements are typically both coarse in precision and corrupted by noise. To improve quality, previous works denoise a synthesized 3D point cloud a posteriori after projecting the imperfect depth data onto 3D space. Instead, we enhance depth measurements on the sensed images a priori, exploiting inherent 3D geometric correlation across views, before synthesizing a 3D point cloud from the improved measurements. By enhancing closer to the actual sensing process, we benefit from optimization targeting specifically the depth image formation model, before subsequent processing steps that can further obscure measurement errors. Mathematically, for each pixel row in a pair of rectified viewpoint depth images, we first construct a graph reflecting inter-pixel similarities via metric learning using data in previous enhanced rows. To optimize left and right viewpoint images simultaneously, we write a non-linear mapping function from left pixel row to the right based on 3D geometry relations. We formulate a MAP optimization problem, which, after suitable linear approximations, results in an unconstrained convex and differentiable objective, solvable using fast gradient method (FGM). Experimental results show that our method noticeably outperforms recent denoising algorithms that enhance after 3D point clouds are synthesized.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point Cloud (PC) is a signal representation composed of discrete geometric samples of a physical object in 3D space, useful for a range of imaging applications such as immersive communication and virtual / augmented reality (AR/VR) [1, 2]. With recent advance and ubiquity of inexpensive active sensing technologies like Microsft Kinect and Intel RealSense, one method to generate a PC is to deploy multiple depth sensors to capture depth measurements (in the form of images) of an object from different viewpoints, then project these measurements to 3D space to synthesize a PC [3, 4]. Limitations in the depth acquisition process mean that the acquired depth measurements suffer from both imprecision and additive noise. This results in a noisy synthesized PC, and previous works focus on denoising PCs using a variety of methods: low-rank prior, low-dimensional manifold model (LDMM), surface smoothness priors expressed as graph total variation (GTV), graph Laplacian regularizer (GLR), feature graph Laplacian regularizer (GFLR), etc [5, 6, 7].

However, all the aforementioned denoising methods enhance a PC a posteriori, i.e., after a PC is synthesized from corrupted depth measurements. Recent work in image denoising [8, 9] has shown that by denoising raw sensed RGB measurements directly on the Bayer-patterned grid before demosaicking, contrast boosting and other steps typical in an image construction pipeline [10, 11] that obscure acquisition noise, one can dramatically improve the denoising performance compared to denoising on the image constructed after the pipeline (up to 15dB in PSNR). Inspired by these work, we propose to enhance111We call our processing an “enhancement” that performs joint denoising and dequantization based on our depth image formation model. measurements in acquired depth images a priori, before projection to synthesize a PC. In our case, by enhancing closer to the actual physical sensing process before various steps in a PC synthesis pipeline including registration, stitching and filtering, we benefit from optimization that directly targets our depth-sensor-specific image formation model with a finite-bit pixel representation.

Specifically, towards a graph-smoothness signal prior [12, 13, 14], for each pixel row in a pair of rectified viewpoint depth images, we first construct a sparse graph reflecting inter-pixel similarities via metric learning [7] using data in previous enhanced rows. To exploit inter-view correlation and optimize left and right viewpoint images simultaneously, we write a non-linear mapping function from the left pixel row to the right based on 3D geometry relations. Using a depth image formation model that accounts for both additive noise and quantization, we formulate a maximum a posteriori (MAP) optimization problem, which, after suitable linear approximations, results in an unconstrained convex and differentiable objective, solvable using fast gradient method (FGM) [15]. Experimental results show that by enhancing measurements at the depth image level, our method outperforms several recent PC denoising algorithms [16, 17, 18] in two commonly used PC error metrics [19].

Related Work: Previous work on depth image enhancement [20, 21, 22] typically enhances one depth map at a time using image-based signal priors. When given two (or more) viewpoint depth maps, by ignoring the inherent cross-correlation between the views and optimizing each separately, the resulting quality is sub-optimal. One exception is [23], which considers noiseless but quantized observations per pixel from two views as signals in specified quantization bins. To reconstruct, the most likely signal inside both sets of quantization bins is chosen. Our work differs from [23] in that our image formation model considers both additive noise and quantization, leading to a more challenging MAP problem involving likelihood and prior terms from both views. We address this using appropriate linear approximations and FGM.

2 System Overview

We assume a capturing system where the same 3D object is observed by two consumer-level depth cameras from different viewpoints, separated by distance . Specifically, there exist overlapping fields of view (FoV) from the two cameras, so that there are multiple observations of the same 2D object surface. See Fig. 1 for an illustration. Each depth camera returns as output a depth map of resolution and finite precision: each pixel is a noise-corrupted observation of the physical distance between the camera and the object, quantized to a -bit representation. Without access to the underlying hardware pipeline, we assume that the depth map is the “rawest” signal we can acquire from the sensor.

Figure 1: An example of the camera system.

For simplicity, we assume that the two captured depth maps are rectified; i.e., pixels in a row in the left view are capturing the same horizontal slice of the object as pixels in the same row

in the right view. Rectification is a well-studied computer vision problem, and a known procedure

[24] can be executed as a pre-processing step prior to our enhancement algorithm.

3 Problem Formulation

We first describe a depth-sensor-specific image formation model and a mapping from left-view pixels to right-view pixels. We next define likelihood term and signal prior for depth images. Finally, we formulate a MAP optimization problem to enhance multiview depth measurements.

3.1 Image Formation Model

Denote by () an observed depth pixel row in the left (right) view. See Fig. 1 for details. (We forego index in in the sequel for notation simplicity.) Observed pixels are noise-corrupted and quantized versions of the original depth measurements, and , respectively. Specifically, observation and true signal are related via the following formation model222Practical quantization of depth measurements into -bit representation for existing sensors are often non-uniform: larger depth distances are quantized into larger quantization bins. For simplicity, we model uniform quantization here, but the non-uniform generalization is straightforward.:


where is a quantization parameter, and is a zero-mean additive noise. The same formation model applies for the right view. The goal is to optimally reconstruct signal given observation .

3.2 View-to-view Mapping

Pixel rows of the rectified left and right views, and , are projections from the same 2D object surface onto two different camera planes, and thus are related. For simplicity, we assume that there is no occlusion when projecting a 3D object to the two camera views. We employ a known 1D warping procedure [25] to relate and . For the -th pixel in the left view, , its (non-integer) horizontal position in the right view after projection is


where is the disparity, and is the camera focal length. Note that is a function of both left pixel’s integer horizontal position and depth value .

Assuming that the object surface is smooth, we interpolate right pixel row

given left pixel row as


where is a real weight matrix. (3) states that is linearly interpolated from using weights , where is a function of signal . In particular, we model the weight between right pixel and left pixel as


In words, weight is larger if the distance between the projected position of left pixel and the target pixel position in is small. To simplify (4), we assume a constant . Combining with (2), (4) is rewritten as


Since is differentiable, we use the first-order Taylor series expansion around to get a linear approximation, where

is the first estimate. Thus,


where is the Jacobian matrix (first-order partial derivatives) of at , and

is a constant vector.

3.3 Likelihood Term

We assume that the zero-mean additive noise

follows a jointly Gaussian distribution;


, the probability density function (pdf) of



where is a positive definite (PD) precision matrix, and

is the noise variance. Given observation

, the likelihood term is


where the region over which the integration (10) is performed is defined as


We assume that left and right noise and are independent.

Figure 2: Affine approximation of the Gaussian pdf.

The integration in (10) over a jointly Gaussian pdf is non-trivial. Instead, we first approximate over the region as an affine function


where contants and can be computed via Taylor series expansion at given (9). See 1-D case in Fig. 2 for an illustration. For reasonably small , this is a good approximation. We now rewrite (10) as


where (14) is proven in the Appendix.

3.4 Signal Prior

As done in recent graph-based image processing work [26, 27, 28], we model the similarities among pixels in using a graph Laplacian matrix , and thus prior can be written as:


We assume that the previous pixel rows in the left depth image have been enhanced, and assuming in addition that the next row follows a similar image structure, can be learned from the previous rows. See Section 4 for details.

3.5 MAP Formulation

We now formulate a MAP problem for as follows.


where in (17) we substituted for , and in (18) we split up the first term since left and right noise, and , are independent.

To ease optimization, we minimize the negative log of (19):


(21) is an unconstrained convex and differentiable objective; we can solve for its minimum efficiently using FGM.

4 Feature Graph Learning

4.1 Learning Metric for Graph Construction

When pixel row of the left view is optimized, we assume that the previous rows, , have already been enhanced into . Using these enhanced rows, we compute graph Laplacian to define prior in (15). Because in a practice , estimating reliably using only signal observations is a known difficult small data learning problem. In particular, established graph learning algorithms such as graphical LASSO [29] and constrained -norm minimization (CLIME) [30] that compute a sparse precision matrix using as input an accurate empirical covariance matrix estimated from a large number of observations do not work in our small data learning scenario.

Instead, inspired by [7] we construct an appropriate similarity graph via metric learning. We first assume that associated with each pixel (graph node) in is a length- relevant feature vector (to be discussed). The feature distance between two nodes and is computed using a real, symmetric and PD metric matrix as


Since is PD, for . The edge weight between nodes andd is then computed using a Gaussian kernel:


To optimize , we minimize the graph Laplacian regularizer (GLR) evaluated using previous pixel rows:


where edge weights in Laplacian is computed using features and of the -th observation and equations (22) and (23). To optimize in (24), [7] proposed a fast optimization algorithm to optimize the diagonal and off-diagonal entries of alternately. See [7] for details.

4.2 Feature Selection for Metric Learning

To construct a feature vector for each pixel in , we first compute the pixel’s corresponding surface normal by projecting it to 3D space and computing it using its neighboring points via method [31]. Then together with depth value and location in the 2D grid, we construct . Because is symmetric, the number of matrix entries we need to estimate is only .

5 Experiments


methods Adirondack ArtL Teddy Recycle Playtable


50 APSS 3.63 3.47 2.76 3.92 4.08
3-7[0.8pt/2pt] 14.45 11.73 7.26 15.79 17.42
RIMLS 3.47 3.35 2.67 3.72 4.09
3-7[0.8pt/2pt] 13.26 11.21 7.09 15.11 17.06
MRPCA 2.91 3.05 2.55 3.17 3.21
3-7[0.8pt/2pt] 8.86 8.73 6.21 10.21 9.17
Proposed 2.11 2.26 1.56 2.45 3.09
3-7[0.8pt/2pt] 4.88 6.34 2.79 7.00 8.78


70 APSS 4.12 3.80 3.09 4.34 4.46
3-7[0.8pt/2pt] 18.56 13.88 8.96 20.07 18.91
RIMLS 3.83 3.67 3.00 4.16 4.38
3-7[0.8pt/2pt] 17.26 13.41 8.73 19.42 18.92
MRPCA 3.42 3.45 2.89 3.76 3.48
3-7[0.8pt/2pt] 12.57 11.39 7.98 14.80 11.00
Proposed 2.32 2.48 1.68 2.68 3.26
3-7[0.8pt/2pt] 5.97 7.64 3.32 8.47 10.43


90 APSS 4.40 4.28 3.38 4.80 4.91
3-7[0.8pt/2pt] 21.97 17.07 11.08 25.11 26.18
RIMLS 4.19 4.13 3.30 4.59 4.83
3-7[0.8pt/2pt] 21.15 16.52 10.70 24.16 23.46
MRPCA 3.78 3.91 3.20 4.20 3.95
3-7[0.8pt/2pt] 16.11 14.07 9.69 19.10 14.52
Proposed 2.47 2.70 1.84 2.92 3.45
3-7[0.8pt/2pt] 6.95 9.15 4.08 10.45 13.22


Table 1: C2C and C2P results of competing methods at three noise levels.

We conducted simulations with five depth image pairs provided in Middlebury datasets [32]: Adirondack, Recycle, Playtable, Teddy and ArtL. By projecting left and right views to 3D space, the first three generate PCs with around 700000 points, Teddy with 337500 points and ArtL with 192238 points. Gaussian noise with zero mean and variance of 50, 70 and 90 is added to both left and right views, which are then quantized into 256 distinct values. To compute the precision matrix for noise in pixel row , we use previous estimated noise terms to compute the covariance matrix, where and . When learning metric for graph construction, we consider previous pixel rows. To reduce computation complexity, the same optimized is used for the next pixel rows. Based on the feature vector in of the current row , we can finally compute the corresponding Laplacian .

Our proposed 3D PC enhancement method is compared against three existing PC denoising algorithms: APSS [16], RIMLS [17] and the moving robust principle component analysis (MRPCA) algorithm [18]

. APSS and RIMLS are implemented with MeshLab software, and the source code of MRPCA is provided by the authors. Two commonly used PC evaluation metrics, point-to-point (C2C) error and point-to-plane (C2P) error between ground truth and denoising point sets, are employed.

After projecting both noise-corrupted and quantized left and right views into a PC, one can employ three mentioned PC denoising algorithms. C2C and C2P results of different methods with three noise levels are shown in Table 1. Overall, our method achieves by far the best performance in both metrics and all three noise levels, with C2C reduced by 0.68, 0.92, 1.13; and C2P reduced by 2.68, 4.38, 5.93 on average compared to the second best algorithm for 50, 70, 90, respectively.

Visual results for Recycle is shown in Fig. 3. For better visualization, we use CloudCompare software to show the C2C absolute distances between the ground truth points and their closest denoised points. We observe that our proposed method achieves smaller C2C errors (in blue) compared to the competitors.

Figure 3: Comparison of visual results for Recycle when 50. From blue to red, C2C absolute errors gradually become larger. More blue points are noticely included in the proposed method.

6 Conclusion

Point clouds are typically synthesized from finite-precision depth measurements that are noise-corrupted. In this paper, we improve the quality of a synthesized point cloud by jointly enhancing multiview depth images—the “rawest” signal we can acquire from an off-the-shelf sensor—prior to modules in a typical point cloud synthesis pipeline that obscure acquisition noise. We formulate a graph-based MAP optimization that specifically targets an image formation model accounting for both additive noise and quantization. Simulation results show that our proposed scheme outperforms competing schemes that denoise point clouds after the synthesis pipeline.

Appendix A Proof of Multiple Integral

We prove (14) by induction. Consider first the base case () when and , where . Integral in (10) in this case is a single integral, and one can easily check that . Consider next the inductive case and assume , when . If the dimension of the signal is actually , when integrating the first variables, the -th term is treated the same as constant , thus,

where and are vectors for only the first terms. Since is constant, like the base case one can easily integrate this, resulting in

where and are vectors for all terms.


  • [1] M. Wien, J. M Boyce, T. Stockhammer, and W.-H. Peng, “Standardization status of immersive video coding,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 5–17, 2019.
  • [2] M. L. Steven, Virtual Reality, Cambridge University Press: Cambridge, UK, 2016.
  • [3] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, Cambridge university press, 2003.
  • [4] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The apolloscape dataset for autonomous driving,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , 2018, pp. 954–960.
  • [5] C. Dinesh, G. Cheung, I. V Bajić, and C. Yang, “Local 3D point cloud denoising via bipartite graph approximation & total variation,” in 2018 IEEE International Workshop on Multimedia Signal Processing. IEEE, 2018, pp. 1–6.
  • [6] J. Zeng, G. Cheung, M. Ng, J. Pang, and C. Yang, “3D point cloud denoising using graph laplacian regularization of a low dimensional manifold model,” IEEE Transactions on Image Processing, vol. 29, pp. 3474–3489, 2019.
  • [7] W. Hu, X. Gao, G. Cheung, and Z. Guo, “Feature graph learning for 3D point cloud denoising,” in arXiv preprint arXiv: 1907.09138, 2019.
  • [8] A. Punnappurath and M. S Brown, “Learning raw image reconstruction-aware deep image compressors,” IEEE transactions on pattern analysis and machine intelligence, Early Access, 2019.
  • [9] R. M. Nguyen and M. S Brown, “Raw image reconstruction using a self-contained sRGB-JPEG image with only 64 KB overhead,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1655–1663.
  • [10] S. Farsiu, M. Elad, and P. Milanfar,

    “Multiframe demosaicing and super-resolution of color images,”

    IEEE transactions on image processing, vol. 15, no. 1, pp. 141–159, 2005.
  • [11] D. Sahu, A. Bhargava, and P. Badal, “Contrast image enhancement using various approaches: A review,” Journal of Image Processing & Pattern Recognition Progress, vol. 4, no. 3, pp. 39–45, 2017.
  • [12] A. Ortega, P. Frossard, J. Kovacevic, J. M. F. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” in Proceedings of the IEEE, May 2018, vol. 106, no.5, pp. 808–828.
  • [13] G. Cheung, E. Magli, Y. Tanaka, and M. Ng, “Graph spectral image processing,” in Proceedings of the IEEE, May 2018, vol. 106, no.5, pp. 907–930.
  • [14] J. Pang and G. Cheung, “Graph Laplacian regularization for inverse imaging: Analysis in the continuous domain,” in IEEE Transactions on Image Processing, April 2017, vol. 26, no.4, pp. 1770–1785.
  • [15] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
  • [16] G. Guennebaud and M. Gross, “Algebraic point set surfaces,” in ACM SIGGRAPH 2007 papers, pp. 23–es. 2007.
  • [17] A C. Öztireli, G. Guennebaud, and M. Gross, “Feature preserving point set surfaces based on non-linear kernel regression,” in Computer Graphics Forum. Wiley Online Library, 2009, vol. 28, pp. 493–501.
  • [18] E. Mattei and A. Castrodad, “Point cloud denoising via moving rpca,” in Computer Graphics Forum. Wiley Online Library, 2017, vol. 36, pp. 123–137.
  • [19] D. Tian, H. Ochimizu, C. Feng, R. Cohen, and A. Vetro, “Geometric distortion metrics for point cloud compression,” in 2017 IEEE International Conference on Image Processing. IEEE, 2017, pp. 3460–3464.
  • [20] W. Hu, G. Cheung, and M. Kazui, “Graph-based dequantization of block-compressed piecewise smooth images,” in IEEE Signal Processing Letters, February 2016, vol. 23, no.2, pp. 242–246.
  • [21] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning dynamic guidance for depth image enhancement,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3769–3778.
  • [22] J. Jeon and S. Lee, “Reconstruction-based pairwise depth dataset for depth image enhancement using CNN,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 422–438.
  • [23] P. Wan, G. Cheung, P. Chou, D. Florencio, C. Zhang, and O. Au, “Precision enhancement of 3D surfaces from compressed multiview depth maps,” in IEEE Signal Processing Letters, October 2015, vol. 22, no.10, pp. 1676–1680.
  • [24] C. Loop and Z. Zhang, “Computing rectifying homographies for stereo vision,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149). IEEE, 1999, vol. 1, pp. 125–131.
  • [25] J. Jin, A. Wang, Y. Zhao, C. Lin, and B. Zeng, “Region-aware 3-D warping for DIBR,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953–966, 2016.
  • [26] V. Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, 2016, pp. 920–929.
  • [27] H. E Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.
  • [28] Y. Bai, G. Cheung, X. Liu, and W. Gao, “Graph-based blind image deblurring from a single photograph,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1404–1418, 2018.
  • [29] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” in Biostatistics, 2008, vol. 9, no.3, pp. 432–441.
  • [30] T. Cai, W. Liu, and X. Luo, “A constrained minimization approach to sparse precision matrix estimation,” in Journal of the American Statistical Association, 2011, vol. 106, pp. 594–607.
  • [31] H. Avron, A. Sharf, C. Greif, and D. Cohen-Or, -sparse reconstruction of sharp point set surfaces,” ACM Transactions on Graphics (TOG), vol. 29, no. 5, pp. 1–12, 2010.
  • [32] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” in German conference on pattern recognition. Springer, 2014, pp. 31–42.