1 Introduction
Point Cloud (PC) is a signal representation composed of discrete geometric samples of a physical object in 3D space, useful for a range of imaging applications such as immersive communication and virtual / augmented reality (AR/VR) [1, 2]. With recent advance and ubiquity of inexpensive active sensing technologies like Microsft Kinect and Intel RealSense, one method to generate a PC is to deploy multiple depth sensors to capture depth measurements (in the form of images) of an object from different viewpoints, then project these measurements to 3D space to synthesize a PC [3, 4]. Limitations in the depth acquisition process mean that the acquired depth measurements suffer from both imprecision and additive noise. This results in a noisy synthesized PC, and previous works focus on denoising PCs using a variety of methods: lowrank prior, lowdimensional manifold model (LDMM), surface smoothness priors expressed as graph total variation (GTV), graph Laplacian regularizer (GLR), feature graph Laplacian regularizer (GFLR), etc [5, 6, 7].
However, all the aforementioned denoising methods enhance a PC a posteriori, i.e., after a PC is synthesized from corrupted depth measurements. Recent work in image denoising [8, 9] has shown that by denoising raw sensed RGB measurements directly on the Bayerpatterned grid before demosaicking, contrast boosting and other steps typical in an image construction pipeline [10, 11] that obscure acquisition noise, one can dramatically improve the denoising performance compared to denoising on the image constructed after the pipeline (up to 15dB in PSNR). Inspired by these work, we propose to enhance^{1}^{1}1We call our processing an “enhancement” that performs joint denoising and dequantization based on our depth image formation model. measurements in acquired depth images a priori, before projection to synthesize a PC. In our case, by enhancing closer to the actual physical sensing process before various steps in a PC synthesis pipeline including registration, stitching and filtering, we benefit from optimization that directly targets our depthsensorspecific image formation model with a finitebit pixel representation.
Specifically, towards a graphsmoothness signal prior [12, 13, 14], for each pixel row in a pair of rectified viewpoint depth images, we first construct a sparse graph reflecting interpixel similarities via metric learning [7] using data in previous enhanced rows. To exploit interview correlation and optimize left and right viewpoint images simultaneously, we write a nonlinear mapping function from the left pixel row to the right based on 3D geometry relations. Using a depth image formation model that accounts for both additive noise and quantization, we formulate a maximum a posteriori (MAP) optimization problem, which, after suitable linear approximations, results in an unconstrained convex and differentiable objective, solvable using fast gradient method (FGM) [15]. Experimental results show that by enhancing measurements at the depth image level, our method outperforms several recent PC denoising algorithms [16, 17, 18] in two commonly used PC error metrics [19].
Related Work: Previous work on depth image enhancement [20, 21, 22] typically enhances one depth map at a time using imagebased signal priors. When given two (or more) viewpoint depth maps, by ignoring the inherent crosscorrelation between the views and optimizing each separately, the resulting quality is suboptimal. One exception is [23], which considers noiseless but quantized observations per pixel from two views as signals in specified quantization bins. To reconstruct, the most likely signal inside both sets of quantization bins is chosen. Our work differs from [23] in that our image formation model considers both additive noise and quantization, leading to a more challenging MAP problem involving likelihood and prior terms from both views. We address this using appropriate linear approximations and FGM.
2 System Overview
We assume a capturing system where the same 3D object is observed by two consumerlevel depth cameras from different viewpoints, separated by distance . Specifically, there exist overlapping fields of view (FoV) from the two cameras, so that there are multiple observations of the same 2D object surface. See Fig. 1 for an illustration. Each depth camera returns as output a depth map of resolution and finite precision: each pixel is a noisecorrupted observation of the physical distance between the camera and the object, quantized to a bit representation. Without access to the underlying hardware pipeline, we assume that the depth map is the “rawest” signal we can acquire from the sensor.
For simplicity, we assume that the two captured depth maps are rectified; i.e., pixels in a row in the left view are capturing the same horizontal slice of the object as pixels in the same row
in the right view. Rectification is a wellstudied computer vision problem, and a known procedure
[24] can be executed as a preprocessing step prior to our enhancement algorithm.3 Problem Formulation
We first describe a depthsensorspecific image formation model and a mapping from leftview pixels to rightview pixels. We next define likelihood term and signal prior for depth images. Finally, we formulate a MAP optimization problem to enhance multiview depth measurements.
3.1 Image Formation Model
Denote by () an observed depth pixel row in the left (right) view. See Fig. 1 for details. (We forego index in in the sequel for notation simplicity.) Observed pixels are noisecorrupted and quantized versions of the original depth measurements, and , respectively. Specifically, observation and true signal are related via the following formation model^{2}^{2}2Practical quantization of depth measurements into bit representation for existing sensors are often nonuniform: larger depth distances are quantized into larger quantization bins. For simplicity, we model uniform quantization here, but the nonuniform generalization is straightforward.:
(1) 
where is a quantization parameter, and is a zeromean additive noise. The same formation model applies for the right view. The goal is to optimally reconstruct signal given observation .
3.2 Viewtoview Mapping
Pixel rows of the rectified left and right views, and , are projections from the same 2D object surface onto two different camera planes, and thus are related. For simplicity, we assume that there is no occlusion when projecting a 3D object to the two camera views. We employ a known 1D warping procedure [25] to relate and . For the th pixel in the left view, , its (noninteger) horizontal position in the right view after projection is
(2) 
where is the disparity, and is the camera focal length. Note that is a function of both left pixel’s integer horizontal position and depth value .
Assuming that the object surface is smooth, we interpolate right pixel row
given left pixel row as(3) 
where is a real weight matrix. (3) states that is linearly interpolated from using weights , where is a function of signal . In particular, we model the weight between right pixel and left pixel as
3.3 Likelihood Term
We assume that the zeromean additive noise
follows a jointly Gaussian distribution;
i.e., the probability density function (pdf) of
is(9) 
where is a positive definite (PD) precision matrix, and
is the noise variance. Given observation
, the likelihood term is(10) 
where the region over which the integration (10) is performed is defined as
(11) 
We assume that left and right noise and are independent.
The integration in (10) over a jointly Gaussian pdf is nontrivial. Instead, we first approximate over the region as an affine function
(12) 
where contants and can be computed via Taylor series expansion at given (9). See 1D case in Fig. 2 for an illustration. For reasonably small , this is a good approximation. We now rewrite (10) as
(13)  
(14) 
where (14) is proven in the Appendix.
3.4 Signal Prior
As done in recent graphbased image processing work [26, 27, 28], we model the similarities among pixels in using a graph Laplacian matrix , and thus prior can be written as:
(15) 
We assume that the previous pixel rows in the left depth image have been enhanced, and assuming in addition that the next row follows a similar image structure, can be learned from the previous rows. See Section 4 for details.
3.5 MAP Formulation
We now formulate a MAP problem for as follows.
(16)  
(17)  
(18)  
(19) 
where in (17) we substituted for , and in (18) we split up the first term since left and right noise, and , are independent.
To ease optimization, we minimize the negative log of (19):
(20)  
(21) 
(21) is an unconstrained convex and differentiable objective; we can solve for its minimum efficiently using FGM.
4 Feature Graph Learning
4.1 Learning Metric for Graph Construction
When pixel row of the left view is optimized, we assume that the previous rows, , have already been enhanced into . Using these enhanced rows, we compute graph Laplacian to define prior in (15). Because in a practice , estimating reliably using only signal observations is a known difficult small data learning problem. In particular, established graph learning algorithms such as graphical LASSO [29] and constrained norm minimization (CLIME) [30] that compute a sparse precision matrix using as input an accurate empirical covariance matrix estimated from a large number of observations do not work in our small data learning scenario.
Instead, inspired by [7] we construct an appropriate similarity graph via metric learning. We first assume that associated with each pixel (graph node) in is a length relevant feature vector (to be discussed). The feature distance between two nodes and is computed using a real, symmetric and PD metric matrix as
(22) 
Since is PD, for . The edge weight between nodes andd is then computed using a Gaussian kernel:
(23) 
To optimize , we minimize the graph Laplacian regularizer (GLR) evaluated using previous pixel rows:
(24)  
(25) 
where edge weights in Laplacian is computed using features and of the th observation and equations (22) and (23). To optimize in (24), [7] proposed a fast optimization algorithm to optimize the diagonal and offdiagonal entries of alternately. See [7] for details.
4.2 Feature Selection for Metric Learning
To construct a feature vector for each pixel in , we first compute the pixel’s corresponding surface normal by projecting it to 3D space and computing it using its neighboring points via method [31]. Then together with depth value and location in the 2D grid, we construct . Because is symmetric, the number of matrix entries we need to estimate is only .
5 Experiments


methods  Adirondack  ArtL  Teddy  Recycle  Playtable  


50  APSS  3.63  3.47  2.76  3.92  4.08 
37[0.8pt/2pt]  14.45  11.73  7.26  15.79  17.42  
RIMLS  3.47  3.35  2.67  3.72  4.09  
37[0.8pt/2pt]  13.26  11.21  7.09  15.11  17.06  
MRPCA  2.91  3.05  2.55  3.17  3.21  
37[0.8pt/2pt]  8.86  8.73  6.21  10.21  9.17  
Proposed  2.11  2.26  1.56  2.45  3.09  
37[0.8pt/2pt]  4.88  6.34  2.79  7.00  8.78  


70  APSS  4.12  3.80  3.09  4.34  4.46 
37[0.8pt/2pt]  18.56  13.88  8.96  20.07  18.91  
RIMLS  3.83  3.67  3.00  4.16  4.38  
37[0.8pt/2pt]  17.26  13.41  8.73  19.42  18.92  
MRPCA  3.42  3.45  2.89  3.76  3.48  
37[0.8pt/2pt]  12.57  11.39  7.98  14.80  11.00  
Proposed  2.32  2.48  1.68  2.68  3.26  
37[0.8pt/2pt]  5.97  7.64  3.32  8.47  10.43  


90  APSS  4.40  4.28  3.38  4.80  4.91 
37[0.8pt/2pt]  21.97  17.07  11.08  25.11  26.18  
RIMLS  4.19  4.13  3.30  4.59  4.83  
37[0.8pt/2pt]  21.15  16.52  10.70  24.16  23.46  
MRPCA  3.78  3.91  3.20  4.20  3.95  
37[0.8pt/2pt]  16.11  14.07  9.69  19.10  14.52  
Proposed  2.47  2.70  1.84  2.92  3.45  
37[0.8pt/2pt]  6.95  9.15  4.08  10.45  13.22  

We conducted simulations with five depth image pairs provided in Middlebury datasets [32]: Adirondack, Recycle, Playtable, Teddy and ArtL. By projecting left and right views to 3D space, the first three generate PCs with around 700000 points, Teddy with 337500 points and ArtL with 192238 points. Gaussian noise with zero mean and variance of 50, 70 and 90 is added to both left and right views, which are then quantized into 256 distinct values. To compute the precision matrix for noise in pixel row , we use previous estimated noise terms to compute the covariance matrix, where and . When learning metric for graph construction, we consider previous pixel rows. To reduce computation complexity, the same optimized is used for the next pixel rows. Based on the feature vector in of the current row , we can finally compute the corresponding Laplacian .
Our proposed 3D PC enhancement method is compared against three existing PC denoising algorithms: APSS [16], RIMLS [17] and the moving robust principle component analysis (MRPCA) algorithm [18]
. APSS and RIMLS are implemented with MeshLab software, and the source code of MRPCA is provided by the authors. Two commonly used PC evaluation metrics, pointtopoint (C2C) error and pointtoplane (C2P) error between ground truth and denoising point sets, are employed.
After projecting both noisecorrupted and quantized left and right views into a PC, one can employ three mentioned PC denoising algorithms. C2C and C2P results of different methods with three noise levels are shown in Table 1. Overall, our method achieves by far the best performance in both metrics and all three noise levels, with C2C reduced by 0.68, 0.92, 1.13; and C2P reduced by 2.68, 4.38, 5.93 on average compared to the second best algorithm for 50, 70, 90, respectively.
Visual results for Recycle is shown in Fig. 3. For better visualization, we use CloudCompare software to show the C2C absolute distances between the ground truth points and their closest denoised points. We observe that our proposed method achieves smaller C2C errors (in blue) compared to the competitors.
6 Conclusion
Point clouds are typically synthesized from finiteprecision depth measurements that are noisecorrupted. In this paper, we improve the quality of a synthesized point cloud by jointly enhancing multiview depth images—the “rawest” signal we can acquire from an offtheshelf sensor—prior to modules in a typical point cloud synthesis pipeline that obscure acquisition noise. We formulate a graphbased MAP optimization that specifically targets an image formation model accounting for both additive noise and quantization. Simulation results show that our proposed scheme outperforms competing schemes that denoise point clouds after the synthesis pipeline.
Appendix A Proof of Multiple Integral
We prove (14) by induction. Consider first the base case () when and , where . Integral in (10) in this case is a single integral, and one can easily check that . Consider next the inductive case and assume , when . If the dimension of the signal is actually , when integrating the first variables, the th term is treated the same as constant , thus,
where and are vectors for only the first terms. Since is constant, like the base case one can easily integrate this, resulting in
where and are vectors for all terms.
References
 [1] M. Wien, J. M Boyce, T. Stockhammer, and W.H. Peng, “Standardization status of immersive video coding,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 5–17, 2019.
 [2] M. L. Steven, Virtual Reality, Cambridge University Press: Cambridge, UK, 2016.
 [3] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, Cambridge university press, 2003.

[4]
X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang,
“The apolloscape dataset for autonomous driving,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, 2018, pp. 954–960.  [5] C. Dinesh, G. Cheung, I. V Bajić, and C. Yang, “Local 3D point cloud denoising via bipartite graph approximation & total variation,” in 2018 IEEE International Workshop on Multimedia Signal Processing. IEEE, 2018, pp. 1–6.
 [6] J. Zeng, G. Cheung, M. Ng, J. Pang, and C. Yang, “3D point cloud denoising using graph laplacian regularization of a low dimensional manifold model,” IEEE Transactions on Image Processing, vol. 29, pp. 3474–3489, 2019.
 [7] W. Hu, X. Gao, G. Cheung, and Z. Guo, “Feature graph learning for 3D point cloud denoising,” in arXiv preprint arXiv: 1907.09138, 2019.
 [8] A. Punnappurath and M. S Brown, “Learning raw image reconstructionaware deep image compressors,” IEEE transactions on pattern analysis and machine intelligence, Early Access, 2019.
 [9] R. M. Nguyen and M. S Brown, “Raw image reconstruction using a selfcontained sRGBJPEG image with only 64 KB overhead,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1655–1663.

[10]
S. Farsiu, M. Elad, and P. Milanfar,
“Multiframe demosaicing and superresolution of color images,”
IEEE transactions on image processing, vol. 15, no. 1, pp. 141–159, 2005.  [11] D. Sahu, A. Bhargava, and P. Badal, “Contrast image enhancement using various approaches: A review,” Journal of Image Processing & Pattern Recognition Progress, vol. 4, no. 3, pp. 39–45, 2017.
 [12] A. Ortega, P. Frossard, J. Kovacevic, J. M. F. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” in Proceedings of the IEEE, May 2018, vol. 106, no.5, pp. 808–828.
 [13] G. Cheung, E. Magli, Y. Tanaka, and M. Ng, “Graph spectral image processing,” in Proceedings of the IEEE, May 2018, vol. 106, no.5, pp. 907–930.
 [14] J. Pang and G. Cheung, “Graph Laplacian regularization for inverse imaging: Analysis in the continuous domain,” in IEEE Transactions on Image Processing, April 2017, vol. 26, no.4, pp. 1770–1785.
 [15] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
 [16] G. Guennebaud and M. Gross, “Algebraic point set surfaces,” in ACM SIGGRAPH 2007 papers, pp. 23–es. 2007.
 [17] A C. Öztireli, G. Guennebaud, and M. Gross, “Feature preserving point set surfaces based on nonlinear kernel regression,” in Computer Graphics Forum. Wiley Online Library, 2009, vol. 28, pp. 493–501.
 [18] E. Mattei and A. Castrodad, “Point cloud denoising via moving rpca,” in Computer Graphics Forum. Wiley Online Library, 2017, vol. 36, pp. 123–137.
 [19] D. Tian, H. Ochimizu, C. Feng, R. Cohen, and A. Vetro, “Geometric distortion metrics for point cloud compression,” in 2017 IEEE International Conference on Image Processing. IEEE, 2017, pp. 3460–3464.
 [20] W. Hu, G. Cheung, and M. Kazui, “Graphbased dequantization of blockcompressed piecewise smooth images,” in IEEE Signal Processing Letters, February 2016, vol. 23, no.2, pp. 242–246.
 [21] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning dynamic guidance for depth image enhancement,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3769–3778.
 [22] J. Jeon and S. Lee, “Reconstructionbased pairwise depth dataset for depth image enhancement using CNN,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 422–438.
 [23] P. Wan, G. Cheung, P. Chou, D. Florencio, C. Zhang, and O. Au, “Precision enhancement of 3D surfaces from compressed multiview depth maps,” in IEEE Signal Processing Letters, October 2015, vol. 22, no.10, pp. 1676–1680.
 [24] C. Loop and Z. Zhang, “Computing rectifying homographies for stereo vision,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149). IEEE, 1999, vol. 1, pp. 125–131.
 [25] J. Jin, A. Wang, Y. Zhao, C. Lin, and B. Zeng, “Regionaware 3D warping for DIBR,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953–966, 2016.
 [26] V. Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, 2016, pp. 920–929.
 [27] H. E Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.
 [28] Y. Bai, G. Cheung, X. Liu, and W. Gao, “Graphbased blind image deblurring from a single photograph,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1404–1418, 2018.
 [29] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” in Biostatistics, 2008, vol. 9, no.3, pp. 432–441.
 [30] T. Cai, W. Liu, and X. Luo, “A constrained minimization approach to sparse precision matrix estimation,” in Journal of the American Statistical Association, 2011, vol. 106, pp. 594–607.
 [31] H. Avron, A. Sharf, C. Greif, and D. CohenOr, “sparse reconstruction of sharp point set surfaces,” ACM Transactions on Graphics (TOG), vol. 29, no. 5, pp. 1–12, 2010.
 [32] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling, “Highresolution stereo datasets with subpixelaccurate ground truth,” in German conference on pattern recognition. Springer, 2014, pp. 31–42.
Comments
There are no comments yet.