Three dimensional shape acquisition of highly dynamic and deformable objects is an increasingly active research topic in computer vision with the development of high-speed 3D video sensors[1, 2]
. It is a fundamental and critical prerequisite of numerous applications, such as dynamic face recognition, action and behavior perception [4, 5], object deformation analysis, etc. However, the 3D sequences from high-speed 3D video sensors usually suffer from serious spatial noise and temporal fluctuations that degrade the performance of 3D reconstruction. The inaccuracy of the high frame rate 3D sequence is caused by multiple general factors, including calibration error, non-uniform illumination, the surface property and motion of scenes or objects, etc. Additionally, resulting from the sensor technology, there are a small number of out-of-sync pixels that produce spatial noise and temporal fluctuations in the 3D sequence. Therefore, denoising high frame rate 3D/depth sequences and thus improving the performance of 3D dynamic and deformable shape acquisition are of significant value.
3D/depth noise characterization and models [6, 7, 8, 9, 10] provide an important basis for boosting 3D reconstruction performance. Noise in a 3D/depth image can be generally characterized into three types (spatial, temporal and interference noise) with corresponding theoretical and empirical noise models. Existing 3D/depth image improvement methods mainly focus on reducing spatial axial and lateral noise, smoothing temporal fluctuations and filling non-measured pixels. They are performed either using a single image (adaptive Gaussian filter (Ad-GF) , adaptive bilateral filter (Ad-BF) ) or multiple registered images (KinectFusion , imaging burst ). The multi-view 3D registration based methods [12, 14] are helpful in smoothing the 3D data and thus improving the 3D reconstruction quality, while the performance of the methods on dynamic or deformable objects is still limited. To address this, there are existing algorithms using motion/temporal information in point-based fusion  or filtering, such as velocity-based adaptive threshold filter (Ad-TF) , spatial-temporal divisive normalized bilateral filter (DNBF) , constrained temporal averaging (TA) ). Those algorithms improve the 3D reconstruction of dynamic scenes while are only based on depth information. On the other hand, depth-intensity based 3D/depth noise reduction methods (adaptive joint bilateral filter (Ad-JBF) , guided filter , non-causal spatio-temporal median filter (ST-MF) ) and multi-sensor systems  have been used for boosting the quality of 3D reconstruction. However, due to the limited reconstruction quality of high-speed 3D video sensors, denoising high frame rate sequences is still an open issue.
In this paper, we focus on intensity tracking guided 4D fusion for boosting the 3D reconstruction from high-speed 3D video sensors. The core idea behind the method is that the intensity data of consecutive images can be aligned by a temporal ”stereo” matching algorithm, and then the corresponding 3D point data can be fused in the spatio-temporal domain to reduce the 3D data noise and fluctuations. Our contributions are:
(1) a generic intensity tracking guided multi-frame 4D fusion model that integrates spatial intra-frame filtering and temporal inter-frame fusion. (Sec. II)
(2) a simple yet powerful pipeline for boosting the 3D reconstruction of dynamic and deformable objects. (Sec. III)
(3) we demonstrate these by denoising 3D sequences of stationary, dynamic and deformable objects from a fps 3D video sensor. (Sec. IV)
Ii Proposed Pipeline
The proposed system framework (Fig. 1
) has 2 main stages: (1) intensity tracking guided 3D motion field estimation; (2) spatio-temporal multi-frame 4D fusion. Given a 3D sequencewith pixel-wise registered intensity images and depth images , in the first stage, dense tracking is performed on the intensity sequence using a belief propagation based patch matching algorithm  for optical flow, which obtains continuous intensity motion fields. Then, using the projective camera model, the pixel-wise 3D motion fields of the registered 3D sequence
can be estimated by leveraging the intensity motion fields. In the second stage, piecewise spatio-temporal multi-frame 4D fusion is performed on the 3D sequence using the 3D motion fields. Since the rejected outliers in the 3D motion fields result in holes in the fused 3D sequence, we perform gradient-directed hole filling to repair them. Finally, an improved 3D sequence can be obtained. More details are given in SectionIII.
Iii Intensity Tracking Guided 4D Fusion
This section details the intensity tracking guided 3D motion field estimation and the spatio-temporal multi-frame 4D fusion model.
Iii-a Intensity-guided 3D Motion Field Estimation
For a dynamic 3D object, assume that each intensity image point in consecutive frames is trackable in the temporal domain. Dense tracking is performed on the pixel-wise registered intensity sequence using a particle belief propagation method. This will give a motion field between each pair of consecutive 2D intensity frames . Then, the pixel-wise continuous intensity motion fields give pixel-wise correspondences in the registered depth frames . We iterate the correspondence so each point has a known position in the 3D frame .
The intensity correspondence field is obtained by minimizing an objective function that combines a unary term evaluating point similarity and a pairwise term for piecewise smoothness as:
are the neighbors of 2D intensity pixel in frame . is the unary term that represents the discrepancy of a pair of corresponding 2D intensity patches centered on the corresponding points in the consecutive frames . is a smoothness term to regularize the correspondence field, which can be optimized by minimizing the message (smoothness error) passed by the neighboring intensity patch to the patch .
According to the projective camera model, the point in the 3D frame can be expressed as
where are the calibration parameters (focal length and centers) of the camera, is the depth value, and are intensity pixel coordinates.
With , and
, the 3D correspondence vectorfor the point from frame to frame can be estimated by:
By tracking from frame to frame, we can link the intensity image point to its 3D position in all the frames.
Iii-B Spatio-temporal Multi-frame 4D Fusion
Given consecutive 3D frames, we seek to fuse them into one frame using the continuous 3D motion fields for piecewise spatio-temporal smoothness. Firstly, the outliers in each 3D motion field are removed by verifying pairwise forward and backward motion vectors with a threshold constraint. Specifically, for a pair of 3D motion vectors (or expressed as ) and between a pair of corresponding points in the frame and , the sum of the vectors should be smaller than a threshold (in practice, we choose pixels) as:
The 3D motion vectors that satisfy the threshold constraint are accepted as reasonable motion vectors.
The piecewise spatio-temporal 4D fusion performed on consecutive 3D frames can be expressed as
In the internal summation, is a set of neighbors of the point in the frame . and are Gaussian weights assigned according to the spatial distance and the intensity difference. The intensity-guided weights contribute to the spatial smoothness of the 3D frame, which reduces 3D noise but preserves some geometric structure information. This internal summation computes a bilaterally smoothed point in frame , which is then mapped back to frame using the integrated motion vectors (e.g. ). In the external summation, is a set of neighboring frames of the frame . is a flag for the validity of the integrated 3D motion vector from frame to . is a weight assigned according to the temporal distance. and are the cardinalities of the normalization factors for inter-frame fusion and intra-frame filtering respectively. Eq. (5) gives a smoothed 3D point in the frame . Overall, both the spatial and temporal piecewise smoothness are guided by the 2D intensity information.
without spatial or temporal neighbors is filled with an interpolated point by using its spatial neighboring 3D points as
where is a set of spatial neighbors of , is the index of the neighbor. is the Gaussian weight assigned according to the spatial distance, is the gradient of the neighboring point , and is the cardinality of a normalization factor.
As a result, we can obtain a fused 3D sequence with lower spatial noise and temporal fluctuations.
Iv Results and Discussion
This section presents noise and shape correctness tests on synthetic data and real experiments using a high frame rate 3D sensor to verify the effectiveness and robustness of the proposed algorithm.
Iv-a Synthetic Noise Test
The synthetic measured object is a falling 3D ball with the radius of mm. The synthetic 3D sequence contains 50 3D frames. The resolutions of the intensity image and depth image are pixels and points respectively. The sphere fell with the speed of pixels/frame. The roughness of the 3D surface in one frame was measured by averaging (over the central area of the sphere) the local roughness of a 3D point relative to its neighboring patch with the size of points as
where is the neighboring point in the window around the central point , is the normal vector of the fitted plane of the neighboring points. Note that this form of roughness measure does not have value zero when there is no noise, due to the curvature of the surface. We used the roughness to evaluate the performance because there is no ground truth for the real data experiments and we wanted to be able to compare the simulated and real results using the same measure.
We added Gaussian random noise with varying noise levels to the intensity and depth images, respectively, and then calculated the mean roughness of the reconstructed 3D sequence. The depth noise level varies from 0.1 mm to 0.4 mm. The intensity values are normalized to [0 1] and the intensity noise level varies from 2% to 10% of the highest intensity value. The results were compared with other existing methods including Ad-GF , Ad-BF , Guided filter , DNBF , TA , Ad-JBF , and ST-MF . The mean roughness results (over all frames) w.r.t. different noise levels and algorithms are shown in Fig. 2.
The results in Fig. 2a demonstrate that the performance of the proposed algorithm is superior to other algorithms especially at higher depth noise levels. Some intensity-joint or motion-joint algorithms (Ad-JBF, Guided filtering, DNBF) achieve better results on the synthetic noisy 3D ball than the single image based algorithms such as Ad-BF and Ad-GF. In Fig. 2b, our algorithm has better performance over all the intensity noise levels, followed by the Guided filtering and Ad-JBF. Specifically, for our algorithm, the increments of the mean roughness in lower intensity noise levels are smaller than those in higher levels. This is because the 3D motion vectors are quantized to integral points and some sub-point wrong motion vectors are rejected at the stage of 3D motion field estimation, which increases the robustness of the intensity guided fusion method to some extent.
Iv-B Roughness vs. Shape Correctness Test
Roughness and shape correctness are important coupled parameters for describing the quality of 3D reconstructed data. We seek to improve the smoothness of 3D data without losing the shape correctness when oversmoothing happens. In this part, using the falling noisy synthetic sphere (with known ground truth), we investigated the balance between the reduction in roughness and in shape correctness of different algorithms as the amount of smoothing is varied. The results are shown in Fig. 3. The shape correctness is defined as
Fig. 3a illustrates the balance between roughness and shape correctness on the noisy ball from a side view. Our algorithm’s smoothed depth values (black curve) have both lower roughness and better shape correctness than the raw values, while the DNBF smoothed depth values (red curve) has worse shape correctness when reaching the same roughness. That means the roughness improvement is achieved by sacrificing some shape correctness, which causes unexpected global deformations of the object.
For each algorithm, we varied the size of the smoothing neighborhood and the number of smoothing iterations to enable the algorithms to generate different roughnesses and to investigate the corresponding shape correctness. The initial depth noise level is mm and the intensity noise level is 2%. The quantitative results are shown in Fig. 3b. Overall, applying different noise reduction algorithms, the mean roughness decreases from the raw roughness (3.75 mm) in different degrees, with increasing shape correctness. However, after the best point, oversmoothing causes serious shape correctness loss with almost the same or even slightly decreasing roughness. Specifically, the curves show that our proposed algorithm achieves the best performance (nearest upper left corner), which demonstrates that it can denoise the 3D data while preserving the structural information better.
Iv-C Results on High Frame Rate Sensors
The proposed method was tested on four real 3D objects with different states and surface complexities, including a static plane, a static hand, a falling ball and a speaking human face (as shown in Fig. 4a). The measured stationary plane with textures is mm. The radius of the ball is mm. The 3D sequence of the ball is time-varying since the ball deforms and rotates slightly during the falling. For each object, we captured a 3D sequence using a high-speed DI4D system  that consists of a stereo video sensor with the frame rate of
fps. We applied the proposed method with varying numbers of fused frames to each measured object. For each number of fused frames, we calculated the mean roughness and standard deviation (std) of the 3D sequence. The results are shown in Fig.5. A qualitative example result of the proposed method when fusing 9 frames is shown in Fig. 4, and the corresponding quantitative comparative results are shown in Table 1.
|plane||hand||falling ball||dynamic face|
|Guided filter ||0.34||0.73||0.61||1.21||0.83||2.43||0.60||5.09|
|6D motion field ||0.39||0.59||0.78||0.91||-||-||-||-|
|6D motion field ||-||-||-||-||0.81||1.97||0.52||2.67|
|Ours (9 frames)||0.22||0.31||0.55||0.83||0.71||1.14||0.40||2.73|
One can model the mean roughness presented in Fig. 5 as , where is the std of the structural noise, is the std of the time-varying noise, and is the number of frames fused. The red line in Fig. 5 and Fig. 5 show the above theoretical results fit the experimental results closely. It is obvious that both the mean and std of roughness decrease with the increasing number of frames fused. Compared with the static object, the std of roughness of the dynamic object falls more sharply, when the number of fused images varies from 2 to 9. This is because the number of fused frames mainly influences the temporal dynamic noise, while the dominant noise of the static object is regular structural noise. Overall, we can conclude that the proposed intensity-guided 4D fusion algorithm is more effective and suitable for boosting the 3D reconstruction of dynamic objects.
From the qualitative results in Fig. 4 we see that the 3D noise is obviously reduced by ours so that the surfaces of ROIs of the observed 3D objects are much smoother than those in the raw 3D images, especially for the falling ball. Correspondingly, the comparative results in Table 1 demonstrate that our method achieves the best performance with the lowest mean roughness (spatial noise) and the most stable roughness measure (std: temporal fluctuations).
Iv-D In Comparison to 6D Motion Field Based Fusion
In contrast to 4D fusion based on intensity motion fields for 3D/depth noise reduction, there are a group of algorithms that directly generate volumetric 6D motion fields using depth data from Kinect sensors and reconstruct improved 3D scenes via dense 3D/depth frame registration, such as KinectFusion , DynamicFusion , 3D Deformable Scanning , etc. In those works, the multi-view partial 2.5D scans from the Kinect sensors allow for large geometric and pose variations, while our algorithm works on consecutive frames from a 1000 fps 3D video sensor focusing on dense micro-deformation and fusion. Besides, the 3D noise from the 1000 fps video sensor is closely related to the textures of the observed 3D objects due to the uneven reflectance of the textures, as shown in Fig. 6. Therefore, we directly use intensity information to generate intensity motion fields, guiding the spatio-temporal fusion.
We compared the performance of the proposed algorithm on the same four objects with the 6D motion field based fusion algorithms. For static objects including the static plane and the hand, a 6D transformation between a pair of consecutive 3D frames was generated using the rigid ICP algorithm, then all the registered 3D points were integrated into a volumetric representation for fusion. For the dynamic and deformable objects including the falling ball and the speaking human face, a dense 6D warp field between pairwise 3D frames was generated using the Embedded Deformable model (ED) based registration method. Then, 9 consecutive frames were fused by leveraging the 8 dense flow fields between each pair of 3D frames. We calculated the roughnesses of the surface of each object and mapped them to the object as shown in Fig. 7. The mean roughness and standard deviation of all 3D frames in a sequence were calculated, as listed in Table 1.
Overall, Both the qualitative results in Fig. 7 and comparable results in Table 1 show that our algorithm achieves better results on the datasets. The use of the 2D intensity frames increases the accuracy of dense correspondence and thus improves the spatio-temporal fusion for 3D noise reduction of high frame rate 3D video sensors, especially for the objects with less 3D shape characteristics, such as the plane, hand and ball. Also, our algorithm directly focuses on texture-related 3D noise (Fig. 6), yielding a texture correspondence guided dense 3D motion field. It is more suitable for high frame rate 3D sequences of dynamic and deformable objects even with fewer 3D shape features.
This paper presents a simple yet powerful pipeline for improving the 3D reconstruction of dynamic and deformable objects, using 2D intensity tracking guided multi-frame 4D fusion. The continuous motion fields of a 3D sequence are estimated by leveraging the intensity motion fields that are obtained by dense tracking on a pixel-wise registered intensity sequence. Using a spatial-temporal multi-frame 4D fusion model, consecutive 3D frame fusions are performed for improving the spatial smoothness and temporal stability of the 3D sequence. The experimental results on stationary, dynamic and deforming objects verify that the proposed algorithm achieves state-of-the-art performance with the lowest mean roughness over the reconstructed 3D surface in one frame and the best robustness over the whole 3D sequence. In the future, we would like to apply the proposed algorithm as a part of dynamic 3D shape acquisition and recognition (e.g. dynamic 3D human face and hand gesture recognition) to improve the accuracy and robustness of the 3D reconstruction and recognition of highly dynamic and deformable objects.
This work is supported by the China Scholarship Council (No. 201606020087), National Council for Science and Technology (CONACyT) of Mexico.
-  Y. Xiao, R. B. Fisher, and M. Oscar, “Performance characterization of a high-speed stereo vision sensor for acquisition of time-varying 3d shapes,” Machine Vision and Applications, vol. 22, no. 3, pp. 535–549, 2011.
-  S. Tabata, S. Noguchi, Y. Watanabe, and M. Ishikawa, “High-speed 3d sensing with three-view geometry using a segmented pattern,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 3900–3907.
-  X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,” Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014.
-  J. Wang and Z. Xu, “Stv-based video feature processing for action recognition,” Signal Processing, vol. 93, no. 8, pp. 2151–2168, 2013.
-  J. Xiang and R. Liang, “Motion recognition and synthesis based on 3d sparse representation,” Signal Processing, vol. 110, pp. 82–93, 2015.
-  T. Mallick, P. P. Das, and A. K. Majumdar, “Characterizations of noise in kinect depth images: A review,” IEEE Sensors journal, vol. 14, no. 6, pp. 1731–1740, 2014.
-  K. Khoshelham and S. O. Elberink, “Accuracy and resolution of kinect depth data for indoor mapping applications,” Sensors, vol. 12, no. 2, pp. 1437–1454, 2012.
-  Y. Yu, Y. Song, Y. Zhang, and S. Wen, “A shadow repair approach for kinect depth maps,” in Asian Conference on Computer Vision. Springer, 2012, pp. 615–626.
-  C. V. Nguyen, S. Izadi, and D. Lovell, “Modeling kinect sensor noise for improved 3d reconstruction and tracking,” in 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012 Second International Conference on. IEEE, 2012, pp. 524–530.
-  J.-H. Park, Y.-D. Shin, J.-H. Bae, and M.-H. Baeg, “Spatial uncertainty model for visual features using a kinect™ sensor,” Sensors, vol. 12, no. 7, pp. 8640–8662, 2012.
-  L. Chen, H. Lin, and S. Li, “Depth image enhancement for kinect using region growing and bilateral filter,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 3070–3073.
-  S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison et al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011, pp. 559–568.
-  S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy, “Burst photography for high dynamic range and low-light imaging on mobile cameras,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, p. 192, 2016.
-  C. Zhang, S. Du, J. Liu, and J. Xue, “Robust 3d point set registration using iterative closest point algorithm with bounded rotation angle,” Signal Processing, vol. 120, pp. 777–788, 2016.
-  M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb, “Real-time 3d reconstruction in dynamic scenes using point-based fusion,” in 3DTV-Conference, 2013 International Conference on. IEEE, 2013, pp. 1–8.
-  K. Essmaeel, L. Gallo, E. Damiani, G. De Pietro, and A. Dipandà, “Temporal denoising of kinect depth data,” in Signal Image Technology and Internet Based Systems (SITIS), 2012 Eighth International Conference on. IEEE, 2012, pp. 47–52.
-  J. Fu, S. Wang, Y. Lu, S. Li, and W. Zeng, “Kinect-like depth denoising,” in Circuits and Systems (ISCAS), 2012 IEEE International Symposium on. IEEE, 2012, pp. 512–515.
-  J. Wasza, S. Bauer, and J. Hornegger, “Real-time preprocessing for dense 3-d range imaging on the gpu: defect interpolation, bilateral temporal averaging and guided filtering,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 1221–1227.
-  M. Camplani, T. Mantecon, and L. Salgado, “Depth-color fusion strategy for 3-d scene modeling with kinect,” IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 1560–1571, 2013.
-  K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 6, pp. 1397–1409, 2013.
-  S. Matyunin, D. Vatolin, Y. Berdnikov, and M. Smirnov, “Temporal filtering for depth maps generated by kinect depth camera,” in 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), 2011. IEEE, 2011, pp. 1–4.
-  L. Yang, L. Zhang, H. Dong, A. Alelaiwi, and A. El Saddik, “Evaluating and improving the depth accuracy of kinect for windows v2,” IEEE Sensors Journal, vol. 15, no. 8, pp. 4275–4285, 2015.
-  F. Besse, C. Rother, A. Fitzgibbon, and J. Kautz, “Pmbp: Patchmatch belief propagation for correspondence field estimation,” International Journal of Computer Vision, vol. 110, no. 1, pp. 2–13, 2014.
-  P. H. Torr and A. Zisserman, “Mlesac: A new robust estimator with application to estimating image geometry,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 138–156, 2000.
-  “Dimensional imaging (di4d™),” http://www.di4d.com/.
-  M. Dou, J. Taylor, H. Fuchs, A. Fitzgibbon, and S. Izadi, “3d scanning deformable objects with a single rgbd sensor,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 493–501.
-  R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 343–352.